Tech Firms Including Apple Caught Using YouTube Data to Train AI Models
Apple, Nvidia, Anthropic, and Salesforce have been found using YouTube data to develop their AI models. An investigation by Proof News, co-published with Wired, revealed that subtitles from YouTube videos were extracted without permission to train Large Language Models (LLMs) like ChatGPT. Although video imagery was not involved, the use of YouTube data raises significant concerns about the unauthorized exploitation by tech companies.
YouTube has explicitly stated that utilizing its content for AI training violates its terms of service (ToS). Despite this, YouTube remains a crucial resource for generative AI, especially as the race for text-to-video models accelerates. Marques Brownlee (MKBHD) highlighted that while Apple sourced data from companies that scraped YouTube transcripts, Apple itself technically avoided direct involvement in the scraping process.
Approximately 180,000 YouTube videos were identified in the dataset utilized by these tech firms. This dataset, compiled by a nonprofit and known as The Pile, encompasses various sources like Wikipedia articles, books, and Enron emails. Jennifer Martinez, a spokesperson for Anthropic, mentioned that The Pile contains a small subset of YouTube subtitles, asserting that their usage does not directly conflict with YouTube’s ToS but rather pertains to the dataset’s creators.
The lack of commentary from Apple, Nvidia, and YouTube further underscores the sensitivity of training data sources. After previous controversies, tech firms generally prefer not to disclose the origins of their training data. For instance, OpenAI’s CTO, Mira Murati, has refrained from detailing the data sources for their upcoming video generator, Sora, only stating that the data was publicly available or licensed.
YouTube CEO Sundar Pichai reaffirmed in an interview with The Verge that using video content, including subtitles, without adhering to YouTube’s terms, constitutes a violation of their ToS. He emphasized the expectation that users comply with these conditions when building products.
Source: Tech Firms Including Apple Caught Using YouTube Data to Train AI Models.