简体   繁体   English

映射一系列warc.gz文件,EMR

[英]Mapping a range of warc.gz files, EMR

I have been running a streaming step in AWS/EMR with a mapper and reducer written in Python to map some of the archives in Common Crawl for sentiment analysis. 我一直在AWS / EMR中运行一个流式步骤,使用Python编写的映射器和缩减器来映射Common Crawl中的一些档案以进行情感分析。

I am moving from the older common crawl textData format to the newer warc.gz format and I need to know how I might go about specifying a range of warc.gz files for my EMR input. 我正在从较旧的常见爬网textData格式转换到较新的warc.gz格式,我需要知道如何为我的EMR输入指定一系列warc.gz文件。

For example: 例如:

In the older format I could specify an input range of textData files as such: 在旧格式中,我可以指定textData文件的输入范围:

s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690165636/textData-000[0-9][0-9]

but the new format looks like this: 但新格式如下所示:

first file: 第一档:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc/CC-MAIN-20160205195359-00000-ip-10-236-182-209.ec2.internal.warc.gz

second file: 第二档:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc/CC-MAIN-20160205195359-00001-ip-10-236-182-209.ec2.internal.warc.gz

How would I specify to map a range of these warc.gz files? 我如何指定映射一系列warc.gz文件?

I'm pretty sure you can use the same method you were using previously. 我很确定你可以使用你之前使用的相同方法。 To just read the two files you would use: 要只读取您将使用的两个文件:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc/CC-MAIN-20160205195359-0000[0-1]-ip-10-236-182-209.ec2.internal.warc.gz

Also since these paths are richer than the previous one you have additional ways to specify sets of data to process. 此外,由于这些路径比前一个路径更丰富 ,因此您还可以使用其他方法来指定要处理的数据集。

CC-MAIN-2016-07 is CC-MAIN-YYYY-ww - Ability to specify a set of years or weeks to process. CC-MAIN-2016-07 is CC-MAIN-YYYY-ww - 能够指定一组要处理的年份或周数。

CC-MAIN-20160205195359 is CC-MAIN-YYYYMMDDHHmmss - You can choose a date or time range. CC-MAIN-20160205195359 is CC-MAIN-YYYYMMDDHHmmss - 您可以选择日期或时间范围。

You can download the list of warc file of july 2016 via 您可以通过下载2016年7月的warc文件列表

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-30/warc.paths.gz
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-30/wat.paths.gz
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-30/wet.paths.gz

for accessing via browser attach this to the path mentioned in the file 通过浏览器访问将其附加到文件中提到的路径

commoncrawl.s3.amazonaws.com/

In Your case to access via s3 try appending this to the path 在您通过s3访问的情况下,请尝试将此附加到路径中

s3://commoncrawl/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM