简体   繁体   中英

Python: Multiprocessing Hive queries

I am trying to execute multiple Hive queries by passing the table names from map ie

from pyhive import hive
from multiprocessing import Pool
from functools import partial
import pandas as pd

conn = hive.connect('hive_connection',99999,
                          username='user',
                          password='password',
                          auth='LDAP')

query = 'select * from hive_db.{hive_table_name} limit 500'.format(hive_table_name=hive_table_name)

def hivetable(hive_table_name):
    query = 'select * from hive_db.{hive_table_name} limit 10'.format(table_name=hive_table_name)
    result = pd.read_sql(query,conn)
    return result

if __name__ == "__main__" :
    p = Pool(5)
    print p.map(((hivetable, ['hive_table1','hive_table2','hive_table3'])))

but getting:

TypeError: map() takes at least 3 arguments (2 given)

How can I achieve multiprocessing here and resolve the ongoing issue? Tried other references but couldn't find any about the SQL one.

Any help/ suggestion is highly appreciated.

the problem is you have so many extra parenthesis, when you are calling a map function.

Try this and it should work without any issue.

if __name__ == "__main__" :
    p = Pool(5)
    print p.map(hivetable, ['hive_table1','hive_table2','hive_table3'])

I am considering that actual table you want to process is higher than 3, otherwise it doesn't make sense to create 5 threads on 3 tables.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM