简体   繁体   English

Mapreduce回车

[英]Mapreduce carriage return

I want to process CommonCrawl WARC files in MapReduce using the input format s3a. 我想使用输入格式s3a在MapReduce中处理CommonCrawl WARC文件。

The problem is that the carriage return char at the end of the input lines is removed and tab is put instead (as it is the default delimiter). 问题在于,将删除输入行末尾的回车符,并改用制表符(因为它是默认的定界符)。

Why does this happen? 为什么会这样?

This is the code with which I start MapReduce 这是我启动MapReduce的代码

time yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
  -D mapred.compress.map.output=true \
  -D mapred.reduce.tasks=0 \
  -D mapred.job.name=cc \
  -D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
  -files mapper.py \
  -archives wasbs://cluster@ccscsg.blob.core.windows.net/user/ubuntu/virtualenv/.venv2.zip#venv \
  -mapper mapper.py \
  -input s3a://commoncrawl/crawl-data/CC-MAIN-2018-39/segments/1537267155413.17/warc/CC-MAIN-20180918130631-20180918150631-00000.warc.gz \
  -output /output_warc

mapper.py mapper.py

#!./venv/bin/python
import sys
for line in sys.stdin:
    sys.stdout.write(line)

You could set -D mapreduce.output.textoutputformat.separator=$'\\r' . 您可以设置-D mapreduce.output.textoutputformat.separator=$'\\r' But this will add an \\r to every line, even if there wasn't one in the input. 但这将为每行添加一个\\r ,即使输入中没有一个。

A MapReduce job expects as mapper output a pair and the separator used to separate key and value in the output is set by (mapreduce.output.textoutputformat.separator` (the tab character is the default). MapReduce作业期望在映射器输出中有一对,并且用于分隔输出中键和值的分隔符由(mapreduce.output.textoutputformat.separator`(制表符为默认值)设置。

Btw., WARC files are not text files - there is binary payload (PDFs, images) and the HTML has no fixed content encoding. 顺便说一句,WARC文件不是文本文件-存在二进制有效载荷(PDF,图像),并且HTML没有固定的内容编码。 You may consider to use a WARC parsing library (eg, warcio ) or simply use cc-mrjob or cc-pyspark to do the processing. 您可能考虑使用WARC解析库(例如warcio ),或者简单地使用cc-mrjobcc-pyspark进行处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM