Premium publishers had their information scraped greater than we thought

Google Pixel

Premium publishers had their information scraped greater than we thought

Mojahid

November 10, 2024

A serious matter in AI is how AI firms collect information to coach their fashions. Corporations like The New York Instances are suing OpenAI and Microsoft for scraping its content material to coach ChatGPT. Whereas these firms extract the vast majority of their information from publicly accessible sources, evidently they collect information from extra premium publishers than we’d assume.

AI firms utilizing pay-walled content material to coach their fashions remains to be in a authorized grey space. It’s debated whether or not that is technically copyright infringement. If the chatbot in query reproduces whole sections of the paid content material, then that could possibly be grounds for a lawsuit. That is one purpose for the New York Instances lawsuit. It’s additionally why AI firms wish to reduce offers with so many publishers. That is to keep away from authorized troubles amongst different causes. The one problem is that these AI firms had been more than likely scraping pay-walled information lengthy earlier than the publications knew about it.

AI firms scrape extra information from premium publishers than many assume

A brand new report from Ziff Davis (through Axios) has simply shed some mild on how a lot premium content material AI firms have scraped. For the report, co-authors George Wukoson and Joey Fortuna analyzed a number of LLMs and the content material used to coach them. What they discovered was that a considerable amount of the info used to coach a few of the largest fashions got here from 15 premium publications.

One main instance was GPT-2, which was educated by OpenAI. The researchers took an open-source reproduction of the OpenWebText dataset, which OpenAI used to coach the mannequin. They discovered that about 10% of the knowledge in that dataset got here from premium web sites. Different datasets used to coach older fashions additionally used a ton of knowledge from premium websites.

Which means that a few of the older LLMs (in all probability fashions that by no means powered user-facing chatbots) consisted of a big quantity of knowledge from premium websites. Whereas that’s the case, the report discovered that a few of these older datasets are nonetheless getting used to coach newer fashions. Which means that fashions may nonetheless be utilizing pay-walled materials.

So, whereas a number of publications have been making offers with AI firms, the AI fashions powering lots of the strongest chatbots available on the market are nonetheless utilizing data taken from pay-walled content material.

AI firms scrape extra information from premium publishers than many assume

LEAVE A REPLY Cancel reply