

It’s an open secret that the data sets used to train AI models are deeply flawed.
Image corpora tend to focus on the United States and the West, in part because Western images dominated the Internet when the data sets were compiled. And as a study from the Allen Institute for AI most recently highlighted, the data used to train large language models like Meta’s Llama 2 contains toxic language and biases.
The models amplify these defects in harmful ways. Now, OpenAI says it wants to combat them by partnering with external institutions to create new and hopefully improved data sets.
OpenAI today announced Data Partnerships, an effort to collaborate with third-party organizations to create public and private data sets for training AI models. In a blog post, OpenAI says Data Partnerships aim to “enable more organizations to help lead the future of AI” and “benefit from models that are more useful.”
“To finally do [AI] “That’s for sure and beneficial for all humanity, we would like AI models to deeply understand all topics, industries, cultures and languages, which requires as broad a training data set as possible,” writes OpenAI. “Including your content can make AI models more useful to you by increasing your understanding of your domain.”
As part of the Data Partnerships program, OpenAI says it will collect “large-scale” data sets that “reflect human society” and are not easily accessible online today. While the company plans to work on a wide range of modalities, including images, audio and video, it is particularly looking for data that “expresses human intent” (for example, long writing or conversations) in different languages, topics and formats.
OpenAI says it will work with organizations to digitize training data if necessary, using a combination of optical character recognition and automatic speech recognition tools and removing personal or sensitive information if necessary.
At first, OpenAI is looking to create two types of datasets: an open source dataset that would be public for anyone to use in training AI models and a set of private datasets for training proprietary AI models. Private sets are intended for organizations that want to keep their data private but want OpenAI models to better understand their domain, OpenAI says; So far, OpenAI has worked with the Icelandic government and Miðeind ehf to improve GPT-4’s ability to speak Icelandic and with the Free Law Project to improve its models’ understanding of legal documents.
“In general, we are looking for partners who want to help us teach AI to understand our world so that it is of maximum use to everyone,” writes OpenAI.
So can OpenAI perform better than the many dataset-building efforts that preceded it? I’m not so sure: minimizing bias in data sets is a problem that has stumped many of the world’s experts. At the very least, I hope the company is transparent about the process and the challenges it inevitably encounters when creating these data sets.
Despite the bombastic language of the blog post, there also appears to be a clear commercial motivation here to improve the performance of OpenAI models at the expense of others, and without any compensation to the data owners. I guess that’s within OpenAI’s right. But it seems a bit tone-deaf in light of open letters and lawsuits from creatives who allege that OpenAI trained many of its models in their work without their permission or payment.