In response to the "depletion" of AI training data, tech giants are trying various methods
While major technology companies are making every effort to train big models, they find themselves facing a thorny problem of "depletion" of training corpus and data.
Since OpenAI sparked the artificial intelligence competition, major technology companies have been working hard to train large models. But as the trained models become larger, these tech giants are facing a new problem of "depletion" of training corpus and data.
"Depletion" of training data
In November 2020, the GPT-3 launched by OpenAI received data training of 300 billion tokens. By 2023, Google's new generation language model, PaLM 2, has exceeded 3.6 trillion yuan in tokens.
With the increasing scale of training, technology companies have found that there are not many corpora and data available for training anymore.
Artificial intelligence research firm Epoch predicts that high-quality language data will be depleted by 2026 based on the size of training datasets and the growth of database storage. And it stated, "If our hypothesis is correct, then data will become the main bottleneck for extending ML (Machine Learning) models, and we may see a slowdown in the development of artificial intelligence as a result."
Last year, Sam Altman, CEO of OpenAI, admitted in a speech at a technical conference that AI companies would exhaust all available data on the Internet.
In order to find more new data sources for continuous model training, major technology companies are racking their brains.
As one of the AI leaders in Silicon Valley, Google is considering utilizing consumer data provided in Google documents, tables, and slides.
According to reports, in June last year, Google began requiring employees to expand the scope of using consumer data. Some employees have been informed that the company wishes to use data from free consumer versions of Google Docs, Sheets, Slides, and even restaurant reviews on Maps.
Google updated its privacy policy in July 2023, but the company stated that it has not expanded the data types used to train artificial intelligence models.
In addition, Google has also targeted its video platform YouTube.
There are reports that Google uses text recordings from YouTube videos to train its artificial intelligence models. Due to the possibility of infringing on the copyright of YouTube creators, Google has kept this choice confidential internally.
Currently, Google's rules allow it to use YouTube user data to develop new features for the platform. But it is currently unclear whether Google can use data on YouTube for commercial services outside of video platforms, such as chatbots.
Geoffrey Lottenberg, an intellectual property lawyer at Berger Singerman law firm, stated that Google is ambiguous about what YouTube video recording can and cannot do. He said, "Whether these data can be used for new commercial services remains to be explained and may lead to litigation."
OpenAI
Coincidentally, OpenAI is also used to train using data from YouTube.
It is understood that OpenAI researchers have created a speech recognition tool called Whisper. This tool can transcribe audio from YouTube videos to generate new dialogue texts. The OpenAI team ultimately transcribed over one million hours of YouTube videos. The transcribed text was fed to GPT-4, which is currently one of the most powerful AI models in the world.
It is reported that among the team transcribing YouTube videos, OpenAI President Greg Brockman is also included. Insiders say that Brockman personally helped collect these videos.
Currently, YouTube prohibits the use of its videos for applications that are "independent" of the video platform. OpenAI's move may have violated YouTube's rules.
Insiders say that some Google employees are already aware that OpenAI has collected YouTube videos as data, but have not chosen to block them. The reason for not stopping is naturally that Google is also using YouTube data to train its artificial intelligence models. If Google chooses to pursue OpenAI to the end, its actions may also be exposed, leading to strong public protests.
In addition, OpenAI has another way to cope with the reduction of training data, which is to use synthetic data to train large models.
Synthetic data refers to data generated through mathematical models, algorithms, or random processes that are similar to real data in some aspects, but are not directly collected from the real environment. Synthetic data is commonly used in scenarios such as data preprocessing, model training, and dataset enhancement.
Last May, Altman stated, "As long as you can cross the horizon of synthetic data events, the model is intelligent enough to generate good synthetic data, and everything will be fine."
The problem with using synthetic data to train artificial intelligence systems is that it may reinforce some of the errors and limitations of artificial intelligence. According to reports, OpenAI is researching a process to address this issue, where one AI model generates data and another AI model makes judgments on the relevant data.
Some artificial intelligence experts warn that synthetic data should be used with caution as it is a distorted version of real data. However, this trend of using synthetic data seems difficult to reverse. Consulting firm Gartner estimates that by 2030, synthetic data in artificial intelligence models will exceed real data.
Meta
It is reported that since the second quarter of last year, senior executives within Meta have been very concerned about the reduction of trainable data and have held multiple discussions to seek more alternative solutions.
According to an internal conference recording, Ahmad Al Dahle, Meta's vice president in charge of generative AI, said that his team made use of almost all English books, essays, poems and news articles available on the Internet.
Dahler further stated that unless more data is obtained, Meta cannot compete with ChatGPT. In March and April 2023, some business development leaders, engineers, and lawyers of the company held almost daily meetings to address this issue.
Although Meta operates a vast social network, the company does not have much high-quality data to use. Because its social media platforms are not places where people write high-quality article content.
In Meta's discussion, it was mentioned that it is possible to acquire renowned publishing house Simon&Schuster. This publishing house has collaborated with multiple authors and was acquired by private equity firm KKR for $1.62 billion last year. If the publishing house is not acquired, a certain price can also be paid for each book to obtain full licensing rights for the content of the entire book.
Database becomes a new "vault"
With the thirst of artificial intelligence companies for training data, some "artificial intelligence data transactions" have also begun to appear in the market. Some companies make money by selling their data to technology companies.
Photobucket is a world-renowned image hosting website that once had approximately 70 million users and accounted for nearly half of the online photo market in the United States. But according to third-party data, only 2 million people are currently using Photobucket. As the AI craze approaches, this company has ushered in a "second spring".
Ted Leonard, CEO of Photobucket, stated in an interview that he is in negotiations with multiple technology companies to authorize Photobucket's 13 billion photos and videos for training generative AI models.
According to Leonard, the price of each photo ranges from 5 cents to 1 dollar, and the price of videos is higher, with each video selling for over 1 dollar. However, the specific price may vary depending on the buyer and the type of image sought.
In addition, another image provider, Shutterstock, also stated that in the months following ChatGPT's debut, companies such as Meta, Google, Amazon, and Apple reached agreements with them to use hundreds of millions of images, videos, and music files from their libraries for training.
Shutterstock CFO Jarrod Yahes stated that transactions with large technology companies initially amounted to $25 million to $50 million per transaction, but most companies later expanded their transaction size. Yahes added that small technology companies have also followed suit and sparked a new wave of purchasing frenzy in the past two months. It is reported that Shutterstock's early customers included OpenAI.
According to researchers at Business Research Insights, the current size of the artificial intelligence data market may reach $2.5 billion, and this number may grow to $30 billion within a decade.
·Original
Disclaimer: The views in this article are from the original author and do not represent the views or position of Hawk Insight. The content of the article is for reference, communication and learning only, and does not constitute investment advice. If it involves copyright issues, please contact us for deletion.