GeForce GPU giant data scraped 80 years of videos each day for AI training in order to 'unlock multiple downstream applications critical to Nvidia.

August 6, 2024
escobar

61
0
0

GeForce GPU giant data scraped 80 years of videos each day for AI training in order to 'unlock multiple downstream applications critical to Nvidia.

Leaked documents, such as spreadsheets, emails, chat messages, and other sources, show that Nvidia used millions of YouTube videos, Netflix and other sources to create an AI model for its Omniverse, autonomous vehicle, and digital avatar platforms.

404 Media, which investigated the documents, reported the astonishing, yet perhaps not surprising, extent of the data scraping. It was discovered that staff of an internal project codenamed Cosmos, (the same name as Nvidia's Cosmos deep learning service but different), used dozens of virtual computers on Amazon Web Service to download so many videos each day that Nvidia accumulated more than 30 million URLs within a month.

Employees discussed copyright laws and usage right repeatedly, and found creative ways to avoid any direct violation. Nvidia, for example, used Google's cloud services to download the YouTube-8M dataset as downloading videos directly is not allowed by the terms of service.

In a leaked Slack discussion, a person said that "we cleared downloads with Google/YouTube in advance and dangled the carrot that we would do so using Google Cloud." They should make some money, after all, they usually get a lot of ad views for 8 million videos.

404 Media asked Nvidia for a comment on the legal and ethics aspects of using copyrighted materials for AI training. The company responded that it was "in full compliance with both the letter and spirit of copyright laws."

Some datasets are only allowed to be used for academic purposes. Nvidia also conducts a lot of research, both internally and with other institutions. However, the leaked materials show that data scraping is intended for commercial use.

Nvidia's not the only company doing this. OpenAI and Runway were both accused of using copyrighted material to train their AI model. You'd think that Nvidia wouldn't have any problem using video content from its GeForce Now services, but the leaked documents prove otherwise.

A senior researcher at Nvidia explained to other employees why: "We do not yet have statistics or videos because the infrastructure is not yet setup to capture lots of game videos and actions." There are both engineering and regulatory hurdles to jump through.

There's no way to avoid the fact that AI models must be trained using billions of datapoints. Some datasets are subject to very strict rules, while others have more lax restrictions. However, laws governing the use of copyrighted material are very clear.

Video content can also contain personal data. There is no federal law that applies directly to the US, but there are many regulations regarding the collection and use of personal data. In the EU, General Data Protection Regulations (GPDR) are laws that clearly state how data can be used even outside the EU.

You might also wonder what happens if Nvidia, for example, is found to have violated certain regulations while training its AI models. If that system is used globally, would it be blocked in certain countries? Would Nvidia, for example, be willing to create a new model that is trained with all the permissions granted just for these locations? Is it possible to retrain a system with legal data?

No matter how you feel about AI, there's no doubt that transparency is a priority, especially when it comes to the use of personal and copyrighted data for commercial purposes. Data scraping will continue ad-hoc if tech companies don't face accountability.