如何在python中为warc文件编写流式mapreduce作业

Question

I am trying to write a mapreduce job for warc files using WARC library of python. 我正在尝试使用python的WARC库为warc文件编写mapreduce作业。 Following code is working for me but i need this code for hadoop mapreduce jobs. 以下代码对我有用，但我需要这个代码用于hadoop mapreduce作业。

import warc
f = warc.open("test.warc.gz")
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

I want this code to read streaming input from warc files ie 我希望此代码从warc文件读取流输入，即

zcat test.warc.gz | warc_reader.py

Kindly tell me how can i modify this code for streaming inputs. 请告诉我如何为流输入修改此代码。 Thanks 谢谢

Answer 1

warc.open() is a shorthand for warc.WARCFile() , and warc.WARCFile() can receive a fileobj argument, where sys.stdin is exactly a file object. warc.open()为速记warc.WARCFile()和warc.WARCFile()可以接收fileobj参数，其中sys.stdin正是一个文件对象。 So what you need to do is something simply like this: 所以你需要做的就是这样：

import sys
import warc

f = warc.open(fileobj=sys.stdin)
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

But things are a little bit difficult under hadoop streaming when your input file is .gz , as hadoop will replace all \\r\\n in WARC file into \\n , which will break the WARC format(refer to this question: hadoop converting \\r\\n to \\n and breaking ARC format ). 但是当你的输入文件是.gz ，在hadoop流下事情有点困难，因为hadoop会将WARC文件中的所有\\r\\n替换为\\n ，这将打破WARC格式（参考这个问题： hadoop conversion \\ r \\ n） \\ n到\\ n并打破ARC格式）。 As the warc package use a regular expression "WARC/(\\d+.\\d+)\\r\\n" to match headers(matching exactly \\r\\n ), you will probably get this error: 由于warc包使用正则表达式"WARC/(\\d+.\\d+)\\r\\n"来匹配标题（完全匹配\\r\\n ），您可能会收到此错误：

IOError: Bad version line: 'WARC/1.0\n'

So you will either modify your PipeMapper.java file as it is recommended in the referred question, or write your own parsing script, which parses the WARC file line by line. 因此，您将修改您在引用的问题中建议的PipeMapper.java文件，或编写您自己的解析脚本，逐行解析WARC文件。

BTW, simply modifying the warc.py to use \\n instead of \\r\\n in matching headers won't work, because it reads content exactly as the length of Content-Length , and expecting two empty lines after that. 顺便说一句，简单地修改warc.py以在匹配的头文件中使用\\n而不是\\r\\n将不起作用，因为它完全按照Content-Length的长度读取内容，并且在此之后期望两个空行。 Therefore what hadoop does will definitely make the length of the content mismatches the attribute Content-Length therefore cause another error like: 因此，hadoop确实会使Content-Length不匹配属性Content-Length因此导致另一个错误，如：

IOError: Expected '\n', found 'abc\n'

如何在python中为warc文件编写流式mapreduce作业

问题描述

1 个解决方案

解决方案1
1 2019-09-05 06:53:11

如何在python中为warc文件编写流式mapreduce作业

问题描述

1 个解决方案

解决方案1 1 2019-09-05 06:53:11

解决方案1
1 2019-09-05 06:53:11