简体   繁体   中英

How to create an udf for hive using python with 3rd party package like sklearn?

I know how to create a hive udf with transform and using , but I can't use sklearn because not all the node in hive cluster has sklearn .
I have an anaconda2.tar.gz with sklearn , What should I do ?

I recently started looking into this approach and I feel like the problem is not about to get all the 'hive nodes' having sklearn on them (as you mentioned above), I feel like it is rather a compatibility issue than 'sklearn node availability' one. I think sklearn is not (yet) designed to run as a parallel algorithm such that large amount of data can be processed in a short time.


What I'm trying to do, as an approach, is to communicate python to 'hive' through 'pyhive' (for example) and implement the necessary sklearn libraries/calls within that code. The rough assumption here that this 'sklearn-hive-python' code will run in each node and deal with the data at the 'map-reduce' level. I cannot say this is the right solution or correct approach (yet) but this is what I can conclude after searching for sometime.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM