简体   繁体   English

如何在python中为warc文件编写流式mapreduce作业

[英]how to write a streaming mapreduce job for warc files in python

I am trying to write a mapreduce job for warc files using WARC library of python. 我正在尝试使用python的WARC库为warc文件编写mapreduce作业。 Following code is working for me but i need this code for hadoop mapreduce jobs. 以下代码对我有用,但我需要这个代码用于hadoop mapreduce作业。

import warc
f = warc.open("test.warc.gz")
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

I want this code to read streaming input from warc files ie 我希望此代码从warc文件读取流输入,即

zcat test.warc.gz | warc_reader.py

Kindly tell me how can i modify this code for streaming inputs. 请告诉我如何为流输入修改此代码。 Thanks 谢谢

warc.open() is a shorthand for warc.WARCFile() , and warc.WARCFile() can receive a fileobj argument, where sys.stdin is exactly a file object. warc.open()为速记warc.WARCFile()warc.WARCFile()可以接收fileobj参数,其中sys.stdin正是一个文件对象。 So what you need to do is something simply like this: 所以你需要做的就是这样:

import sys
import warc

f = warc.open(fileobj=sys.stdin)
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

But things are a little bit difficult under hadoop streaming when your input file is .gz , as hadoop will replace all \\r\\n in WARC file into \\n , which will break the WARC format(refer to this question: hadoop converting \\r\\n to \\n and breaking ARC format ). 但是当你的输入文件是.gz ,在hadoop流下事情有点困难,因为hadoop会将WARC文件中的所有\\r\\n替换为\\n ,这将打破WARC格式(参考这个问题: hadoop conversion \\ r \\ n) \\ n到\\ n并打破ARC格式 )。 As the warc package use a regular expression "WARC/(\\d+.\\d+)\\r\\n" to match headers(matching exactly \\r\\n ), you will probably get this error: 由于warc包使用正则表达式"WARC/(\\d+.\\d+)\\r\\n"来匹配标题(完全匹配\\r\\n ),您可能会收到此错误:

IOError: Bad version line: 'WARC/1.0\n'

So you will either modify your PipeMapper.java file as it is recommended in the referred question, or write your own parsing script, which parses the WARC file line by line. 因此,您将修改您在引用的问题中建议的PipeMapper.java文件,或编写您自己的解析脚本,逐行解析WARC文件。

BTW, simply modifying the warc.py to use \\n instead of \\r\\n in matching headers won't work, because it reads content exactly as the length of Content-Length , and expecting two empty lines after that. 顺便说一句,简单地修改warc.py以在匹配的头文件中使用\\n而不是\\r\\n将不起作用,因为它完全按照Content-Length的长度读取内容,并且在此之后期望两个空行。 Therefore what hadoop does will definitely make the length of the content mismatches the attribute Content-Length therefore cause another error like: 因此,hadoop确实会使Content-Length不匹配属性Content-Length因此导致另一个错误,如:

IOError: Expected '\n', found 'abc\n'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM