简体   繁体   English

如何从 Windows 通过 pyhive 连接到 Hive

[英]How to Connect to Hive via pyhive from Windows

I've been racking my brain for the past couple of days attempting to connect to a Hive server with a Python client using pyhive on Windows .在过去的几天里,我一直在绞尽脑汁尝试使用Windows上的 pyhive 连接到带有 Python 客户端的 Hive 服务器。 I'm new to Hive (pyhive too for that matter), but am a reasonably experienced Python dev.我是 Hive 的新手(pyhive 也是如此),但我是一位经验丰富的 Python 开发人员。 Invariably I get the following error:我总是收到以下错误:

(pyhive-test) C:\dev\sandbox\pyhive-test>python test.py
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    conn = hive.Connection(host='192.168.1.196', port='10000', database='default', auth='NONE')
  File "C:\Users\harnerd\Anaconda3\envs\pyhive-test\lib\site-packages\pyhive\hive.py", line 192, in __init__
    self._transport.open()
  File "C:\Users\harnerd\Anaconda3\envs\pyhive-test\lib\site-packages\thrift_sasl\__init__.py", line 84, in open
    raise TTransportException(type=TTransportException.NOT_OPEN,
thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: Unable to find a callback: 2'

when executing the following script:执行以下脚本时:

from pyhive import hive

conn = hive.Connection(host='192.168.1.196', port='10000', database='default', auth='NONE')
cur = conn.cursor()
cur.execute('show tables')
data = cur.fetchall()
print(data)

The HiveServer2 instance is an out-of-the-box HDP Sandbox VM from Cloudera with HiveServer2 Authentication set to 'None'. HiveServer2 实例是来自 Cloudera 的开箱即用 HDP 沙盒 VM,其中 HiveServer2 身份验证设置为“无”。

Client is an Anaconda virtual environment on Windows 10 with Python 3.8.5 and the following packages installed by conda:客户端是 Windows 10 上的 Anaconda 虚拟环境,使用 Python 3.8.5 和 conda 安装的以下软件包:

  • pyhive 0.6.1 pyhive 0.6.1
  • sasl 0.2.1萨斯勒 0.2.1
  • thrift 0.13.0节俭 0.13.0
  • thrift-sasl 0.4.2节俭-sasl 0.4.2

Right now I'm merely trying to connect to Hive with the script above, but ultimately I intend to use pyhive within SQLAlchemy in a Flask application.现在我只是尝试使用上面的脚本连接到 Hive,但最终我打算在 Flask 应用程序的 SQLAlchemy 中使用 pyhive。 In other words: Flask -> Flask-SQLAlchemy -> SQLAlchemy -> pyhive.换句话说:Flask -> Flask-SQLAlchemy -> SQLAlchemy -> pyhive。 In production the Flask app will be hosted by Cloudera Data Science Workbench (ie some flavor of Linux), but will be developed (and therefore must also run) on Windows systems.在生产中,Flask 应用程序将由 Cloudera Data Science Workbench(即某种 Linux 风格)托管,但将在 Windows 系统上开发(因此也必须运行)。

Of course I've looked at the many questions here, on Cloudera's site, and GitHub relating to Hive connection problems and if someone put a gun to my head I would have to say that trying this from a Windows client may be part of the problem as that doesn't seem to be a very common thing to do.当然,我已经在 Cloudera 的网站和 GitHub 上查看了与 Hive 连接问题有关的许多问题,如果有人用枪指着我的头,我不得不说从 Windows 客户端尝试这个可能是问题的一部分因为这似乎不是一件很常见的事情。

No mechanism available

What does that error even mean?这个错误甚至意味着什么? It sure would be nice if there was some documentation on how to configure and use SASL from python - if there is I would like to know about it.如果有一些关于如何从 python 配置和使用 SASL 的文档,那肯定会很好——如果有的话,我想知道它。

FWIW, the line causing the error is in thrift_sasl/__init__.py : FWIW,导致错误的thrift_sasl/__init__.py

ret, chosen_mech, initial_response = self.sasl.start(self.mechanism)

self.mechanism is 'PLAIN'; self.mechanism是“平原”; chosen_mech and initial_response are empty strings (''). chosen_mechinitial_response是空字符串 ('')。 ret is False, which causes the exception to be thrown. ret为 False,这会导致抛出异常。

I know I'm not the only guy trying to connect to Hive with pyhive on Windows - this guy ( SASL error when trying to connect to hive(hue) by python from my PC - Windows10 ) was, but his 'solution' - install Ubuntu as a VM on his Windows box - isn't going to work for me.我知道我不是唯一一个试图在 Windows 上使用 pyhive 连接到 Hive 的人 - 这个人( 尝试从我的 PC 上通过 python 连接到 hive(hue) 时出现 SASL 错误 - Windows10 )是,但他的“解决方案” - 安装Ubuntu 作为他的 Windows 机器上的虚拟机 - 不适合我。

Long story short, the answer to this problem is that PyHive simply is not supported on Windows.长话短说,这个问题的答案是 Windows 根本不支持 PyHive。 This is due to the fact that PyHive uses the sasl library for Hive connections and sasl is not only difficult to compile from source on Windows but it seems that it simply may not work on Windows.这是因为 PyHive 使用 sasl 库进行 Hive 连接,而且 sasl 不仅难以在 Windows 上从源代码编译,而且似乎根本无法在 Windows 上运行。

The key is providing your own thrift_transport instead of relying on PyHive to create it.关键是提供你自己的 thrift_transport 而不是依赖 PyHive 来创建它。 Devin Stevenson has provided an alternative transport ( https://github.com/devinstevenson/pure-transport ) that works well on Windows and should work on other OSes (I haven't tested this, however). Devin Stevenson 提供了一种替代传输( https://github.com/devinstevenson/pure-transport ),它在 Windows 上运行良好,并且应该在其他操作系统上运行(但是我还没有测试过)。 His repo provides examples using pure transport directly with Hive as well as with SQLAlchemy.他的 repo 提供了直接使用 Hive 以及 SQLAlchemy 进行纯传输的示例。

In my use case I am using it with Flask-SQLAlchemy within a Flask application.在我的用例中,我在 Flask 应用程序中将它与 Flask-SQLAlchemy 一起使用。 The way I inject the thrift transport is like this:我注入节俭运输的方式是这样的:

from flask_sqlalchemy import SQLAlchemy
import puretransport

thrift_transport = puretransport.transport_factory(host='127.0.0.1',
                                                   port=10000,
                                                   username='a_user',
                                                   password='a_password')


class MySQLAlchemy(SQLAlchemy):
    '''
    Subclassing the standard SQLAlchemy class so we can inject our own thrift
    transport which is needed to get pyhive to work on Windows
    '''
    def apply_driver_hacks(self, app, sa_url, options):
        '''
        If the current driver is for Hive, add our thrift transport
        '''
        if sa_url.drivername.startswith('hive'):
            if 'connect_args' not in options:
                options['connect_args'] = {'thrift_transport': thrift_transport}
        return super(MySQLAlchemy, self).apply_driver_hacks(app, sa_url, options)

# later, in models.py...
db = MySQLAlchemy()

class AModelClass(db.Model):
    __tablename__ = 'some_table'
    id = db.Column(db.Integer, primary_key=True)
    # etc...

In my case the URL I use for my Hive connection is simply of the form hive:///{database_name} , ie: hive:///customers , because all the necessary information is passed using the thrift transport.在我的例子中,我用于 Hive 连接的 URL 只是hive:///{database_name}的形式,即: hive:///customers ,因为所有必要的信息都是使用 thrift 传输传递的。 One caveat though - when injecting a thrift transport, PyHive asserts that host , port , auth , kerberos_service_name , and password cannot have any value other than None .但是有一个警告 - 在注入 thrift 传输时,PyHive 断言hostportauthkerberos_service_namepassword不能具有除None以外的任何值。 Unfortunately SQLAlchemy assigns the default Hive port of 10000 to port if no port number is provided.不幸的是,如果没有提供端口号,SQLAlchemy 会将默认的 Hive 端口 10000 分配给port The solution is to replace the HiveDialect.create_connect_args method, as shown here: https://github.com/devinstevenson/pure-transport/issues/7 .解决方案是替换HiveDialect.create_connect_args方法,如下所示: https ://github.com/devinstevenson/pure-transport/issues/7。 Simply subclassing the HiveDialect class will not work here, as the name HiveDialect is in SQLAlchemy's dialect registry and can't simply be replaced.简单地继承 HiveDialect 类在这里是行不通的,因为HiveDialect这个名字在 SQLAlchemy 的方言注册表中并且不能简单地被替换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM