简体   繁体   English

通过Python客户端进行Hive查询

[英]Hive queries via Python client

I have hive 0.8 installed on a hadoop cluster running in AWS EMR. 我在AWS EMR中运行的hadoop集群上安装了hive 0.8。

I am trying to do some data QA, which involves running a hive query and fetching the results into python where some more logic is contained. 我正在尝试进行一些数据质量检查,其中涉及运行配置单元查询并将结果提取到包含更多逻辑的python中。

Currently, this is achieved by sending a hive query as a jobflow step, dumping those results to local storage on the master node, SCP-ing those results to my local machine, and then loading the file with python and parsing the results. 当前,这是通过发送一个配置单元查询作为作业流程步骤,将这些结果转储到主节点上的本地存储,将这些结果SCP放入我的本地计算机,然后使用python加载文件并解析结果来实现的。 All in all, not a very fun process. 总而言之,这不是一个非常有趣的过程。

Ideally, I would be able to do this in a fashion similar to: 理想情况下,我将能够以类似以下方式进行操作:

conn = hive.connect(ip, port, user, pw)
cursor = conn.cursor()
cursor.execute(query)
rs = cursor.fetchall()

It seems that this is supposedly possible. 看来这是可能的。 Hive says that it supports it here . Hive说它在这里支持它。 There is also another SO question that looks like it's doing what I'd like to do. 还有另一个SO问题看起来像是在做我想做的事情。

However, I'm having trouble finding documentation. 但是,我在查找文档时遇到了麻烦。 In particular, I haven't been able to figure out where to obtain the packages used in these examples. 特别是,我无法弄清楚在这些示例中使用的软件包。 It would be immensely helpful if anyone were able to provide detailed instructions as to how to get the python client working, but failing that, it would be helpful just to know where to obtain these packages. 如果有人能够提供有关如何使python客户端正常工作的详细说明,那将非常有帮助,但如果失败,仅知道从何处获取这些软件包将很有帮助。

Looks like the hive_utils package has what you're looking for. 看起来hive_utils包具有您想要的东西。 Looking at the pypi page, you can run queries in the following way: 查看pypi页面,您可以通过以下方式运行查询:

query = """
    SELECT country, count(1) AS cnt
    FROM User
    GROUP BY country
"""
hive_client = hive_utils.HiveClient(
    server=config['HOST'],
    port=config['PORT'],
    db=config['NAME'],
)
for row in hive_client.execute(query):
    print '%s: %s' % (row['country'], row['cnt'])

Installing that should also install the needed thrift packages. 安装该软件包还应安装所需的旧版软件包。

If you build hive from source, the modules will be located here (relative to the hive-trunk directory): 如果从源代码构建Hive,则模块将位于此处(相对于hive-trunk目录):

./build/dist/lib/py ./build/dist/lib/py

You should be able to access the modules if you include that path in your PYTHONPATH environment variable, or you add that path to your python path in your script with the sys module. 如果在PYTHONPATH环境变量中包含该路径,或者使用sys模块将该路径添加到脚本中的python路径,则应该能够访问这些模块。

Also note that there is no longer a module named 'hive'. 另请注意,不再有名为“ hive”的模块。 In the example code you linked 'hive' should be replaced with 'hive_service'. 在示例代码中,您链接的“ hive”应替换为“ hive_service”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM