简体   繁体   English

Hadoop流式处理-files python出现中文乱码

[英]Hadoop streaming had Chinese garbled with -files python

I am not sure "garbled" is the correct word for my issue. 我不确定“乱码”是否适合我的问题。 My issue is like this. 我的问题是这样的。 I use hadoop-streaming-0.20.2-cdh3u6.jar and python to write a mapreduce. 我使用hadoop-streaming-0.20.2-cdh3u6.jar和python编写mapreduce。 The command is like below: 命令如下:

hadoop jarhadoop-streaming-0.20.2-cdh3u6.jar \
 -D mapred.reduce.tasks=0 \
 -D mapred.job.priority=VERY_HIGH \
 -files probability,rule_dict,dict \
 -file "naive_bayes.py" \
 -mapper "python naive_bayes.py" \
 -input xxx \
 -output xxx

the probability,rule_dict,dict are directories in my local storage, files in the them contained Chinese words. rule_dict,dict是我本地存储中的目录,这些文件中的文件包含中文单词。 when I use python to read these files I get bad "garbled".A little fragment of files: 当我使用python读取这些文件时,我会感到糟糕的“乱码”。文件的一些片段:

 石块 0.000025
 最后 0.000321
 老面孔 0.000012
 GP17 0.000012
 圈 0.000136
 看个够 0.000062
 布兰 0.000062

and what read from these files were 从这些文件中读取的是

PB�;��W�mVKZm����e�2��U�؎��"/�1_�u�8)i~�J�N86�l}[��y8�;%(K��/P��<��F/+����=��X�n�

is there a way to solve my issue ? 有办法解决我的问题吗?

I have uploaded the same python script and same directories to the mapper machine, run it from command line, and it running well without this issue. 我已经将相同的python脚本和相同的目录上载到了映射器机器,从命令行运行它,并且在没有出现此问题的情况下运行良好。

words in files are utf-8. 文件中的单词是utf-8。

I have found the issue's root-cause. 我发现了问题的根本原因。 When upload directory with -files, hadoop will create .file_name.crc file in the same directory, so when my code iteration files will also process .file_name.crc, so when encode words in these files will crash. 当使用-files上传目录时,hadoop将在同一目录中创建.file_name.crc文件,因此,当我的代码迭代文件也将处理.file_name.crc时,这些文件中的编码单词将崩溃。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM