[英]how to write a streaming mapreduce job for warc files in python
I am trying to write a mapreduce job for warc files using WARC library of python. 我正在尝试使用python的WARC库为warc文件编写mapreduce作业。 Following code is working for me but i need this code for hadoop mapreduce jobs.
以下代码对我有用,但我需要这个代码用于hadoop mapreduce作业。
import warc
f = warc.open("test.warc.gz")
for record in f:
print record['WARC-Target-URI'], record['Content-Length']
I want this code to read streaming input from warc files ie 我希望此代码从warc文件读取流输入,即
zcat test.warc.gz | warc_reader.py
Kindly tell me how can i modify this code for streaming inputs. 请告诉我如何为流输入修改此代码。 Thanks
谢谢
warc.open()
is a shorthand for warc.WARCFile()
, and warc.WARCFile()
can receive a fileobj
argument, where sys.stdin
is exactly a file object. warc.open()
为速记warc.WARCFile()
和warc.WARCFile()
可以接收fileobj
参数,其中sys.stdin
正是一个文件对象。 So what you need to do is something simply like this: 所以你需要做的就是这样:
import sys
import warc
f = warc.open(fileobj=sys.stdin)
for record in f:
print record['WARC-Target-URI'], record['Content-Length']
But things are a little bit difficult under hadoop streaming when your input file is .gz
, as hadoop will replace all \\r\\n
in WARC file into \\n
, which will break the WARC format(refer to this question: hadoop converting \\r\\n to \\n and breaking ARC format ). 但是当你的输入文件是
.gz
,在hadoop流下事情有点困难,因为hadoop会将WARC文件中的所有\\r\\n
替换为\\n
,这将打破WARC格式(参考这个问题: hadoop conversion \\ r \\ n) \\ n到\\ n并打破ARC格式 )。 As the warc
package use a regular expression "WARC/(\\d+.\\d+)\\r\\n"
to match headers(matching exactly \\r\\n
), you will probably get this error: 由于
warc
包使用正则表达式"WARC/(\\d+.\\d+)\\r\\n"
来匹配标题(完全匹配\\r\\n
),您可能会收到此错误:
IOError: Bad version line: 'WARC/1.0\n'
So you will either modify your PipeMapper.java
file as it is recommended in the referred question, or write your own parsing script, which parses the WARC file line by line. 因此,您将修改您在引用的问题中建议的
PipeMapper.java
文件,或编写您自己的解析脚本,逐行解析WARC文件。
BTW, simply modifying the warc.py
to use \\n
instead of \\r\\n
in matching headers won't work, because it reads content exactly as the length of Content-Length
, and expecting two empty lines after that. 顺便说一句,简单地修改
warc.py
以在匹配的头文件中使用\\n
而不是\\r\\n
将不起作用,因为它完全按照Content-Length
的长度读取内容,并且在此之后期望两个空行。 Therefore what hadoop does will definitely make the length of the content mismatches the attribute Content-Length
therefore cause another error like: 因此,hadoop确实会使
Content-Length
不匹配属性Content-Length
因此导致另一个错误,如:
IOError: Expected '\n', found 'abc\n'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.