[英]Hadoop streaming had Chinese garbled with -files python
I am not sure "garbled" is the correct word for my issue. 我不确定“乱码”是否适合我的问题。 My issue is like this.
我的问题是这样的。 I use hadoop-streaming-0.20.2-cdh3u6.jar and python to write a mapreduce.
我使用hadoop-streaming-0.20.2-cdh3u6.jar和python编写mapreduce。 The command is like below:
命令如下:
hadoop jarhadoop-streaming-0.20.2-cdh3u6.jar \
-D mapred.reduce.tasks=0 \
-D mapred.job.priority=VERY_HIGH \
-files probability,rule_dict,dict \
-file "naive_bayes.py" \
-mapper "python naive_bayes.py" \
-input xxx \
-output xxx
the probability,rule_dict,dict are directories in my local storage, files in the them contained Chinese words. rule_dict,dict是我本地存储中的目录,这些文件中的文件包含中文单词。 when I use python to read these files I get bad "garbled".A little fragment of files:
当我使用python读取这些文件时,我会感到糟糕的“乱码”。文件的一些片段:
石块 0.000025
最后 0.000321
老面孔 0.000012
GP17 0.000012
圈 0.000136
看个够 0.000062
布兰 0.000062
and what read from these files were 从这些文件中读取的是
PB�;��W�mVKZm����e�2��U�؎��"/�1_�u�8)i~�J�N86�l}[��y8�;%(K��/P��<��F/+����=��X�n�
is there a way to solve my issue ? 有办法解决我的问题吗?
I have uploaded the same python script and same directories to the mapper machine, run it from command line, and it running well without this issue. 我已经将相同的python脚本和相同的目录上载到了映射器机器,从命令行运行它,并且在没有出现此问题的情况下运行良好。
words in files are utf-8. 文件中的单词是utf-8。
I have found the issue's root-cause. 我发现了问题的根本原因。 When upload directory with -files, hadoop will create .file_name.crc file in the same directory, so when my code iteration files will also process .file_name.crc, so when encode words in these files will crash.
当使用-files上传目录时,hadoop将在同一目录中创建.file_name.crc文件,因此,当我的代码迭代文件也将处理.file_name.crc时,这些文件中的编码单词将崩溃。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.