AI technologies are only as good as their datasets, and San Francisco-based startup OpenAI aims to improve theirs — particularly in non-English languages.
OpenAI today announced its Data Partnerships program, which seeks to partner with businesses and organizations around the world to create open-source and private datasets for AI training. Those intrigued by the prospect can express interest on the OpenAI website.
“Modern AI technology learns skills and aspects of our world — of people, our motivations, interactions, and the way we communicate — by making sense of the data on which it’s trained,” OpenAI states on its website. “To ultimately make AGI that is safe and beneficial to all of humanity, we’d like AI models to deeply understand all subject matters, industries, cultures, and languages, which requires as broad a training dataset as possible.”
Especially valuable are datasets that “reflect human society and are not already easily accessible online to the public today.” The data can include text, images, audio, or video.
“We’re particularly looking for data that expresses human intention (e.g. long-form writing or conversations rather than disconnected snippets), across any language, topic, and format,” the website states.
The company will work with partners to appropriately digitize and structure the data using tools like optical character recognition, which digitizes PDFs and similar files, and automatic speech recognition to capture the spoken word.
“If the data needs cleaning (e.g. has lots of auto-generated artifacts or transcription errors), we can work with your team to process it into the most useful form,” OpenAI states. “We are not seeking datasets with sensitive or personal information, or information that belongs to a third party; we can work with you to remove this information if you need help.”
The Data Partnerships program is structured into two categories, with more possibly to follow as the program expands. The first is an open-source archive for training language models, which would be available for anyone to train AI models, including the safe training of additional open-source models by OpenAI itself.
“We believe open-source plays an important role in the ecosystem,” the organization states.
The second category is private datasets for training proprietary AI models, including “foundation models and fine-tuned and custom models.” These datasets are best for companies and organizations that wish to remain private but want AI models to better understand their countries, cultures, and languages of origin.
“We’ll treat your data with the level of sensitivity and access controls that you prefer,” OpenAI states.
Controversy arose over the past several months regarding the ethics and legality of training AI on copyrighted material — with The New York Times successfully removing its work from one training dataset. As AI researches and developers navigate uncharted legal waters, OpenAI sees voluntarily-given datasets as essential to its path forward.
“Overall, we are seeking partners who want to help us teach AI to understand our world in order to be maximally helpful to everyone,” OpenAI states on its website. “Together, we can move towards (artificial general intelligence) that benefits all of humanity.”