简体   繁体   English

通过http将AWS Common Crawl的小样本下载到本地计算机

[英]Download small sample of AWS Common Crawl to local machine via http

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests. 我感兴趣的是下载AWS Common Crawl的一小部分原始信息(十个meg顶部)作为信息检索测试的语料库。

The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs. “通用抓取”页面建议我需要一个S3帐户和/或Java程序才能访问它,然后我要筛选的是100 Gb数据,而我所需要的只是几十兆。

There's some code here , but it requires an S3 account and access (although I do like Python). 这里有一些代码 ,但是它需要一个S3帐户和访问权限(尽管我确实喜欢Python)。

Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? 有没有一种方法可以形成一个http(s)URL,该URL可以使我获得一个很小的爬网横截面以达到我的目的? I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again. 我相信我看过一个页面,该页面提出了一种用日,时,分来构造目录的方法,但似乎无法再次找到该页面。

Thanks! 谢谢!

It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. 这很容易:只需从任何每月抓取中随机选择一个WARC(WAT或WET)文件。 The crawls are announced here: https://commoncrawl.org/connect/blog/ 爬网在这里宣布: https : //commoncrawl.org/connect/blog/

  1. take the latest crawl (eg. April 2019 ) 进行最新的爬网(例如, 2019年4月
  2. navigate to the WARC file list and download it (same for WAT or WET) 导航到WARC文件列表并下载(与WAT或WET相同)
  3. randomly select one 随机选择一个
  4. prefix the path with https://commoncrawl.s3.amazonaws.com/ (there is a description in the blog post) and download it 在路径前面加上https://commoncrawl.s3.amazonaws.com/ (博客文章中有说明)并下载它

You're down because every WARC/WAT/WET file is a random sample by its own. 您之所以沮丧,是因为每个WARC / WAT / WET文件本身都是随机样本。 Need more data: just pick more files at random. 需要更多数据:只需随机选择更多文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM