[英]Download small sample of AWS Common Crawl to local machine via http
I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests. 我感兴趣的是下载AWS Common Crawl的一小部分原始信息(十个meg顶部)作为信息检索测试的语料库。
The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs. “通用抓取”页面建议我需要一个S3帐户和/或Java程序才能访问它,然后我要筛选的是100 Gb数据,而我所需要的只是几十兆。
There's some code here , but it requires an S3 account and access (although I do like Python). 这里有一些代码 ,但是它需要一个S3帐户和访问权限(尽管我确实喜欢Python)。
Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? 有没有一种方法可以形成一个http(s)URL,该URL可以使我获得一个很小的爬网横截面以达到我的目的? I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again.
我相信我看过一个页面,该页面提出了一种用日,时,分来构造目录的方法,但似乎无法再次找到该页面。
Thanks! 谢谢!
It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. 这很容易:只需从任何每月抓取中随机选择一个WARC(WAT或WET)文件。 The crawls are announced here: https://commoncrawl.org/connect/blog/
爬网在这里宣布: https : //commoncrawl.org/connect/blog/
You're down because every WARC/WAT/WET file is a random sample by its own. 您之所以沮丧,是因为每个WARC / WAT / WET文件本身都是随机样本。 Need more data: just pick more files at random.
需要更多数据:只需随机选择更多文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.