Hadoop流式处理-files python出现中文乱码

Question

I am not sure "garbled" is the correct word for my issue. 我不确定“乱码”是否适合我的问题。 My issue is like this. 我的问题是这样的。 I use hadoop-streaming-0.20.2-cdh3u6.jar and python to write a mapreduce. 我使用hadoop-streaming-0.20.2-cdh3u6.jar和python编写mapreduce。 The command is like below: 命令如下：

hadoop jarhadoop-streaming-0.20.2-cdh3u6.jar \
 -D mapred.reduce.tasks=0 \
 -D mapred.job.priority=VERY_HIGH \
 -files probability,rule_dict,dict \
 -file "naive_bayes.py" \
 -mapper "python naive_bayes.py" \
 -input xxx \
 -output xxx

the probability,rule_dict,dict are directories in my local storage, files in the them contained Chinese words. rule_dict，dict是我本地存储中的目录，这些文件中的文件包含中文单词。 when I use python to read these files I get bad "garbled".A little fragment of files: 当我使用python读取这些文件时，我会感到糟糕的“乱码”。文件的一些片段：

 石块 0.000025
 最后 0.000321
 老面孔 0.000012
 GP17 0.000012
 圈 0.000136
 看个够 0.000062
 布兰 0.000062

and what read from these files were 从这些文件中读取的是

PB�;��W�mVKZm����e�2��U�؎��"/�1_�u�8)i~�J�N86�l}[��y8�;%(K��/P��<��F/+����=��X�n�

is there a way to solve my issue ? 有办法解决我的问题吗？

I have uploaded the same python script and same directories to the mapper machine, run it from command line, and it running well without this issue. 我已经将相同的python脚本和相同的目录上载到了映射器机器，从命令行运行它，并且在没有出现此问题的情况下运行良好。

words in files are utf-8. 文件中的单词是utf-8。

Answer 1

I have found the issue's root-cause. 我发现了问题的根本原因。 When upload directory with -files, hadoop will create .file_name.crc file in the same directory, so when my code iteration files will also process .file_name.crc, so when encode words in these files will crash. 当使用-files上传目录时，hadoop将在同一目录中创建.file_name.crc文件，因此，当我的代码迭代文件也将处理.file_name.crc时，这些文件中的编码单词将崩溃。

Hadoop流式处理-files python出现中文乱码

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-02-04 03:57:41

Hadoop流式处理-files python出现中文乱码

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-02-04 03:57:41

解决方案1
1 已采纳 2015-02-04 03:57:41