简体繁体 English

通过http将AWS Common Crawl的小样本下载到本地计算机

[英]Download small sample of AWS Common Crawl to local machine via http

原文 2019-04-19 13:02:26 7 1 dataset/ information-retrieval/ corpus/ common-crawl

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests. 我感兴趣的是下载AWS Common Crawl的一小部分原始信息（十个meg顶部）作为信息检索测试的语料库。

The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs. “通用抓取”页面建议我需要一个S3帐户和/或Java程序才能访问它，然后我要筛选的是100 Gb数据，而我所需要的只是几十兆。

There's some code here , but it requires an S3 account and access (although I do like Python). 这里有一些代码，但是它需要一个S3帐户和访问权限（尽管我确实喜欢Python）。

Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? 有没有一种方法可以形成一个http（s）URL，该URL可以使我获得一个很小的爬网横截面以达到我的目的？ I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again. 我相信我看过一个页面，该页面提出了一种用日，时，分来构造目录的方法，但似乎无法再次找到该页面。

Thanks! 谢谢！

1 个解决方案

It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. 这很容易：只需从任何每月抓取中随机选择一个WARC（WAT或WET）文件。 The crawls are announced here: https://commoncrawl.org/connect/blog/ 爬网在这里宣布： https : //commoncrawl.org/connect/blog/

take the latest crawl (eg. April 2019 ) 进行最新的爬网（例如， 2019年4月）
navigate to the WARC file list and download it (same for WAT or WET) 导航到WARC文件列表并下载（与WAT或WET相同）
randomly select one 随机选择一个
prefix the path with https://commoncrawl.s3.amazonaws.com/ (there is a description in the blog post) and download it 在路径前面加上https://commoncrawl.s3.amazonaws.com/ （博客文章中有说明）并下载它

You're down because every WARC/WAT/WET file is a random sample by its own. 您之所以沮丧，是因为每个WARC / WAT / WET文件本身都是随机样本。 Need more data: just pick more files at random. 需要更多数据：只需随机选择更多文件。