Common Crawl is a massive archive of web crawl data created by a small nonprofit that has become a central building block for generative AI (or more specifically LLMs) due to its size and free availability. Yet so far, its role and influence on generative AI has not received a lot of attention. To fill this gap, I studied Common Crawl in-depth and considered both the positive and negative implications of its popularity among LLM builders. You can read the full report here. Sharing it here because I think it's interesting for this sub and curious what you think.
Some key takeaways:
- Common Crawl already exists since 2007 and proving data for AI training has never been its primary goal. Its mission is to level the playing field for technology development by giving free access to data that only companies like Google used to have
- Using Common Crawl's data does not easily align with trustworthy and responsible AI development because Common Crawl deliberately does not curate its data. It doesn't remove hate speech, for example, because it wants its data to be useful for researchers studying hate speech
- Common Crawl's archive is massive, but far from being a “copy of the internet.” Its crawls are automated to prioritize pages on domains that are frequently linked to, making digitally marginalized communities less likely to be included. Moreover, most captured content is English
- In addition, relevant domains like Facebook and the New York Times block Common Crawl from crawling most (or all) of their pages. These blocks are increasing, creating new biases in the crawled data
- Due to Common Crawl’s deliberate lack of curation, AI builders need to filter it with care, but such care is often lacking. Filtered versions of Common Crawl popular for training LLMs like C4 are especially problematic as the filtering techniques used to create them are simplistic and leave lots of harmful content untouched
- Both Common Crawl and AI builders can help making generative AI less harmful. Common Crawl should highlight the limitations and biases of its data, be more transparent and inclusive about its governance, and enforce more transparency by requiring AI builders to attribute using Common Crawl
- AI builders should put more effort into filtering Common Crawl, establish industry standards and best practices for end-user products to reduce potential harms when using Common Crawl or similar sources for training data
- A key issue is that filtered Common Crawl versions are not updated after their original publication to take feedback and criticism into account. Therefore, we need dedicated intermediaries tasked with filtering Common Crawl in transparent and accountable ways that are continuously updated
- Long term, there should be less reliance on sources like Common Crawl and a bigger emphasis on training generative AI on datasets created and curated by people in equitable and transparent ways
[link] [comments]