简体繁体 English

在python中为Hadoop Map Reduce创建自定义可写键/值类型？

[英]Create custom writable key/value type in python for Hadoop Map Reduce?

原文 2018-08-01 23:23:19 1 1 python/ hadoop/ mapreduce

I have worked on Hadoop MR for quite some time and I have created and used custom(extension) Writable classes including MapWritable . 我从事Hadoop MR已有一段时间，并且创建和使用了自定义（扩展） Writable类，包括MapWritable 。 Now I am required to translate the same MR that I have written in Java to Python. 现在，我需要将用Java编写的MR转换为Python。 I do not have experience in python and am now exploring the various libraries for the same. 我没有python的经验，现在正在探索相同的各种库。 I am looking into some options like Pydoop and Mrjob . 我正在研究Pydoop和Mrjob之类的选项。 However, I want to know if these libraries contain the option to create similar custom Writable classes and how to create them. 但是，我想知道这些库是否包含创建类似的自定义Writable类的选项以及如何创建它们。 If not, what possible alternatives exist to do the same? 如果没有，那么有什么可能的替代方法可以做到这一点？

1 个解决方案

In Pydoop, explicit support for custom Hadoop types is still WIP . 在Pydoop中，对自定义Hadoop类型的显式支持仍然是WIP 。 In other words, right now we're not making things easy for the user, but it can be done with a bit of work. 换句话说，目前我们并没有为用户简化事情，但是可以通过一些工作来完成。 A couple of pointers: 几个指针：

Pydoop already includes custom Java code, auto-installed together with the Python package as pydoop.jar . Pydoop已经包含自定义Java代码，该代码与Python软件包一起自动安装为pydoop.jar 。 We pass this extra jar to Hadoop as needed. 我们根据需要将此额外的jar传递给Hadoop。 Adding more Java code is a matter of placing the source in src/ and listing it in JavaLib.java_files in setup.py 添加更多Java代码只需将源代码放在src/并将其JavaLib.java_files在setup.py中的JavaLib.java_files中JavaLib.java_files 。
On the Python side, you need deserializers for the new types. 在Python方面，您需要为新类型使用反序列化器。 See for instance LongWritableDeserializer in pydoop.mapreduce.pipes . 例如，请参见LongWritableDeserializer中的pydoop.mapreduce.pipes 。