I know how to create a hive udf with transform
and using
, but I can't use sklearn
because not all the node in hive cluster has sklearn
.
I have an anaconda2.tar.gz
with sklearn
, What should I do ?
I recently started looking into this approach and I feel like the problem is not about to get all the 'hive nodes' having sklearn on them (as you mentioned above), I feel like it is rather a compatibility issue than 'sklearn node availability' one. I think sklearn is not (yet) designed to run as a parallel algorithm such that large amount of data can be processed in a short time.
What I'm trying to do, as an approach, is to communicate python to 'hive' through 'pyhive' (for example) and implement the necessary sklearn libraries/calls within that code. The rough assumption here that this 'sklearn-hive-python' code will run in each node and deal with the data at the 'map-reduce' level. I cannot say this is the right solution or correct approach (yet) but this is what I can conclude after searching for sometime.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.