Categories AI

Amazon Has Secret Workaround to Scrape GitHub for AI Training Data

Amazon’s GitHub Data Collection Strategy

Amazon is working to circumvent data collection limits on Microsoft’s GitHub in order to gather metadata for training its in-house AI models, according to an internal memo. The company has directed its employees to create and share GitHub accounts to expedite the data scraping process.

The memo from Amazon’s Artificial General Intelligence Group highlighted the need for both quantitative and qualitative metadata from GitHub. However, due to GitHub’s limit of 5,000 data-collection requests per hour per account, Amazon is looking to speed up the collection by using multiple accounts simultaneously.

Rohit Prasad, head scientist and senior VP of Amazon’s AI group, has encouraged employees to participate in this data collection effort. By doing so, Amazon aims to train its most ambitious AI project to date, enabling it to compete with rivals like Microsoft, Google, and Meta in the generative AI space.

There are ethical concerns surrounding this practice, as it involves accessing GitHub data without explicit permissions, potentially violating license agreements. Microsoft itself faces a lawsuit for similar reasons related to its Copilot AI service.

Amazon has stated its commitment to protecting rightsholders and adhering to industry best practices in data collection. The company also mentioned systems in place to properly credit open-source developers.

The internal memo confirmed that Amazon’s legal and security teams approved the GitHub workaround. The memo reassured employees that the strategy adheres to GitHub’s rate limits to avoid account blocks.

Employees were advised to use Amazon work emails and create classic personal tokens to aid in the data collection process. These tokens provide access to a wider range of repositories but might be less secure.

Source: Amazon Has Secret Workaround to Scrape GitHub for AI Training Data.