简体   繁体   English

从Hadoop流中读取HDFS上的快照压缩数据

[英]Read Snappy Compressed data on HDFS from Hadoop Streaming

I have a folder in my HDFS system that contains text files compressed using Snappy codec. 我的HDFS系统中有一个文件夹,其中包含使用Snappy编解码器压缩的文本文件。

Normally, when reading GZIP compressed files in a Hadoop Streaming job, the decompression occurs automatically. 通常,在Hadoop Streaming作业中读取GZIP压缩文件时,解压缩会自动发生。 However, this is not happening when using Snappy compressed data, and I am not able to process the data. 但是,使用Snappy压缩数据时不会发生这种情况,并且我无法处理数据。

How can I read these files and process them in Hadoop Streaming? 如何读取这些文件并在Hadoop流中处理它们?

Many thanks in advance. 提前谢谢了。

UPDATE: 更新:

If I use the command hadoop fs -text file it works. 如果我使用命令hadoop fs -text file它将起作用。 The problem only happens when using hadoop streaming, the data is not decompressed before passed to my python script. 该问题仅在使用hadoop流时发生,数据在传递给我的python脚本之前不会被解压缩。

Do you have snappy codec configured in core-site , like: 您是否在core-site配置了快速的编解码器,例如:

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>

I think I have an answer to the problem. 我想我对这个问题有答案。 It would be great if someone can confirm this. 如果有人可以确认这一点,那就太好了。

Browsing the Cloudera blog. 浏览Cloudera博客。 I found this article explaining the Snappy codec. 我发现这篇文章解释了Snappy编解码器。 As it can be read: 可以看到:

One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can't be processed in parallel using MapReduce. 需要注意的一件事是,Snappy旨在与诸如序列文件或Avro数据文件之类的容器格式一起使用,而不是直接用于纯文本,例如,因为后者不可拆分并且无法在纯文本中进行处理。使用MapReduce并行。

Therefore a file compressed in HDFS using Snappy codec can be read using hadoop fs -text but not in a Hadoop Streaming job (MapReduce). 因此,可以使用hadoop fs -text读取使用Snappy编解码器在HDFS中压缩的文件,但不能在Hadoop Streaming作业(MapReduce)中读取。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM