简体   繁体   中英

Create custom writable key/value type in python for Hadoop Map Reduce?

I have worked on Hadoop MR for quite some time and I have created and used custom(extension) Writable classes including MapWritable . Now I am required to translate the same MR that I have written in Java to Python. I do not have experience in python and am now exploring the various libraries for the same. I am looking into some options like Pydoop and Mrjob . However, I want to know if these libraries contain the option to create similar custom Writable classes and how to create them. If not, what possible alternatives exist to do the same?

In Pydoop, explicit support for custom Hadoop types is still WIP . In other words, right now we're not making things easy for the user, but it can be done with a bit of work. A couple of pointers:

  • Pydoop already includes custom Java code, auto-installed together with the Python package as pydoop.jar . We pass this extra jar to Hadoop as needed. Adding more Java code is a matter of placing the source in src/ and listing it in JavaLib.java_files in setup.py

  • On the Python side, you need deserializers for the new types. See for instance LongWritableDeserializer in pydoop.mapreduce.pipes .

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM