简体   繁体   中英

How to Connect to Hive via pyhive from Windows

I've been racking my brain for the past couple of days attempting to connect to a Hive server with a Python client using pyhive on Windows . I'm new to Hive (pyhive too for that matter), but am a reasonably experienced Python dev. Invariably I get the following error:

(pyhive-test) C:\dev\sandbox\pyhive-test>python test.py
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    conn = hive.Connection(host='192.168.1.196', port='10000', database='default', auth='NONE')
  File "C:\Users\harnerd\Anaconda3\envs\pyhive-test\lib\site-packages\pyhive\hive.py", line 192, in __init__
    self._transport.open()
  File "C:\Users\harnerd\Anaconda3\envs\pyhive-test\lib\site-packages\thrift_sasl\__init__.py", line 84, in open
    raise TTransportException(type=TTransportException.NOT_OPEN,
thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: Unable to find a callback: 2'

when executing the following script:

from pyhive import hive

conn = hive.Connection(host='192.168.1.196', port='10000', database='default', auth='NONE')
cur = conn.cursor()
cur.execute('show tables')
data = cur.fetchall()
print(data)

The HiveServer2 instance is an out-of-the-box HDP Sandbox VM from Cloudera with HiveServer2 Authentication set to 'None'.

Client is an Anaconda virtual environment on Windows 10 with Python 3.8.5 and the following packages installed by conda:

  • pyhive 0.6.1
  • sasl 0.2.1
  • thrift 0.13.0
  • thrift-sasl 0.4.2

Right now I'm merely trying to connect to Hive with the script above, but ultimately I intend to use pyhive within SQLAlchemy in a Flask application. In other words: Flask -> Flask-SQLAlchemy -> SQLAlchemy -> pyhive. In production the Flask app will be hosted by Cloudera Data Science Workbench (ie some flavor of Linux), but will be developed (and therefore must also run) on Windows systems.

Of course I've looked at the many questions here, on Cloudera's site, and GitHub relating to Hive connection problems and if someone put a gun to my head I would have to say that trying this from a Windows client may be part of the problem as that doesn't seem to be a very common thing to do.

No mechanism available

What does that error even mean? It sure would be nice if there was some documentation on how to configure and use SASL from python - if there is I would like to know about it.

FWIW, the line causing the error is in thrift_sasl/__init__.py :

ret, chosen_mech, initial_response = self.sasl.start(self.mechanism)

self.mechanism is 'PLAIN'; chosen_mech and initial_response are empty strings (''). ret is False, which causes the exception to be thrown.

I know I'm not the only guy trying to connect to Hive with pyhive on Windows - this guy ( SASL error when trying to connect to hive(hue) by python from my PC - Windows10 ) was, but his 'solution' - install Ubuntu as a VM on his Windows box - isn't going to work for me.

Long story short, the answer to this problem is that PyHive simply is not supported on Windows. This is due to the fact that PyHive uses the sasl library for Hive connections and sasl is not only difficult to compile from source on Windows but it seems that it simply may not work on Windows.

The key is providing your own thrift_transport instead of relying on PyHive to create it. Devin Stevenson has provided an alternative transport ( https://github.com/devinstevenson/pure-transport ) that works well on Windows and should work on other OSes (I haven't tested this, however). His repo provides examples using pure transport directly with Hive as well as with SQLAlchemy.

In my use case I am using it with Flask-SQLAlchemy within a Flask application. The way I inject the thrift transport is like this:

from flask_sqlalchemy import SQLAlchemy
import puretransport

thrift_transport = puretransport.transport_factory(host='127.0.0.1',
                                                   port=10000,
                                                   username='a_user',
                                                   password='a_password')


class MySQLAlchemy(SQLAlchemy):
    '''
    Subclassing the standard SQLAlchemy class so we can inject our own thrift
    transport which is needed to get pyhive to work on Windows
    '''
    def apply_driver_hacks(self, app, sa_url, options):
        '''
        If the current driver is for Hive, add our thrift transport
        '''
        if sa_url.drivername.startswith('hive'):
            if 'connect_args' not in options:
                options['connect_args'] = {'thrift_transport': thrift_transport}
        return super(MySQLAlchemy, self).apply_driver_hacks(app, sa_url, options)

# later, in models.py...
db = MySQLAlchemy()

class AModelClass(db.Model):
    __tablename__ = 'some_table'
    id = db.Column(db.Integer, primary_key=True)
    # etc...

In my case the URL I use for my Hive connection is simply of the form hive:///{database_name} , ie: hive:///customers , because all the necessary information is passed using the thrift transport. One caveat though - when injecting a thrift transport, PyHive asserts that host , port , auth , kerberos_service_name , and password cannot have any value other than None . Unfortunately SQLAlchemy assigns the default Hive port of 10000 to port if no port number is provided. The solution is to replace the HiveDialect.create_connect_args method, as shown here: https://github.com/devinstevenson/pure-transport/issues/7 . Simply subclassing the HiveDialect class will not work here, as the name HiveDialect is in SQLAlchemy's dialect registry and can't simply be replaced.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM