Dealing with Missing and Synthetic Data

At Argilla, we are creating an open-source data platform for LLMs, through which, everyone is able to build robust small and large language models with human and machine feedback. Given my expertise in Natural Language Processing (NLP), I'll delve into the crucial aspects of missing and synthetic data, particularly in the context of textual data.

Potential biases

Within the realm of textual data, we can consider data as being the raw data but since the introduction of fine-tuning by transfer learning, we can see a rise in core models, like BERT, making up a part of what I would consider to be knowledge or data. In that sense, missing data can be the root cause of significant biases and inequalities. We’ve seen this during the introduction of new models being firstly and sometimes solely available within their own domain, language or cultural space, and, despite their more general nature, we can see this ever so clearly with the rise of LLMs. Luckily, we’ve seen some ways in which these biases can bridged.

Dealing with biases

Within feature data we often refer to concepts like deletion, imputation and prediction of individual missing values to bridge these biases, however, for NLP solution space is often more model-focused. Don’t get me wrong, the individual data values that we do have should still need to be properly curated and of the highest possible quality because this same data is still required for the model-focused solutions to work.

Besides the general push towards creating core models for lower resource languages and domains, we have also seen solutions that impute knowledge based on smart heuristics like ensemble model workflows, cross-lingual model capabilities and (even) more general models.

An example of an ensemble model workflow that I have worked with during my graduation thesis is the usage of paraphrasing methods to diversify data. There are several direct paraphrase models available but another great approach for doing this is back-translation. Here n-hop translations between different languages cause the text to slightly deform. Using Google Translate, we can even do this with a span-preserving setting, which might be useful for span-based tasks like entity extraction. We can even apply a 1-hop translation to use lower-resource-language data for an English transformer model.

Due to lexical and semantic overlap in languages, it has proven to be possible to train models with cross-lingual properties. These models can be used to make predictions on multiple languages, hence, fine-tuning a model on language A will ensure it is possible to make decent predictions for language B as well. This approach cannot easily be benchmarked but it does offer a way to start higher quality data, which in turn can be curated for training a language-specific model.

Lastly, we’ve come to the section of (even) more general models, like LLMs. Simply said, this is achieved with more data and parameters. Everyone reading this paper has heard stories about the vast volume of data and amount of parameters of the recent OpenAI models, but researchers behind the previously mentioned cross-lingual models and even “attention is all you need” found that this direction was a viable solution to significantly improve performance.

Synthetic data

The generative nature and the fact that LLMs can be tuned to respond to instructions allows them to perform similarly across an infinite number of tasks, including the ones above. Additionally, this makes them the perfect candidate for diverse synthetic text generation because we can more directly influence the way this data is being generated by using proper formulation. The art of formulating prompts for LLMs has even proved to be so important that we have even seen job postings for so-called prompt engineers.

This way of dealing with missing data by synthetic data generation still comes with several risks and difficulties. One of them is that LLMs too, are mostly available as English-first models, even though they have empirically proven to work for other languages too. Additionally, the closed-source nature of many of the LLMs leads to vendor lock-in, privacy risk and licensing issues, and even the LLMs that do get released as open-source occasionally have licenses that limit them from being used for synthetic data generation. Hence, this once again proves that we still need to gather and curate high-quality data to also be able to fine-tune our own models, including LLMs.

Get started

Do you want to see an actual use case more about using synthetic data as a solution to missing data? Take a look at this tutorial. Or take a look at this video about creating your own LLM dataset.