简体   繁体   English

如何使用python读取hadoop映射文件?

[英]How to read hadoop map file using python?

I have map file that is block compressed using DefaultCodec. 我有使用DefaultCodec进行块压缩的地图文件。 The map file is created by java application like this: 映射文件是由Java应用程序创建的,如下所示:

MapFile.Writer writer =
            new MapFile.Writer(conf, path,
                    MapFile.Writer.keyClass(IntWritable.class),
                    MapFile.Writer.valueClass(BytesWritable.class),
                    MapFile.Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()));

This file is stored in hdfs and I need to read some key,values from it in another application using python. 该文件存储在hdfs中,我需要在另一个使用python的应用程序中从中读取一些键,值。 I can't find any library that can do that. 我找不到任何可以做到这一点的图书馆。 Do you have any suggestion and example? 您有什么建议和例子吗?

Thanks 谢谢

I would suggest using Spark which has a function called textFile() which can read files from HDFS and turn them into RDDs for further processing using other Spark libraries. 我建议使用Spark,它具有一个名为textFile()的功能,该功能可以从HDFS读取文件并将其转换为RDD,以便使用其他Spark库进行进一步处理。

Here's the documentation : Pyspark 这是文档: Pyspark

Create a reader as follow: 创建一个阅读器,如下所示:

path = '/hdfs/path/to/file'
key = LongWritable()
value = LongWritable()
reader = MapFile.Reader(path)
while reader.next(key, value):
        print key, value

Check out these hadoop.io.MapFile Python examples 查看这些hadoop.io.MapFile Python示例

And available methods in MapFile.py 以及MapFile.py中的可用方法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM