简体   繁体   English

使用Pyhive插入Hive会引发错误

[英]Insert Into Hive Using Pyhive invoke an error

I am using pyhive to interact with hive. 我正在使用pyhive与蜂巢互动。

The SELECT statement going well using this code bellow. 使用此代码, SELECT语句运行良好。

# Import hive module and connect
from pyhive import hive
conn = hive.Connection(host="HOST")
cur = conn.cursor()

# Import pandas
import pandas as pd

# Store select query in dataframe 
all_tables = pd.read_sql("SELECT * FROM table LIMIT 5", conn)
print all_tables

# Using curssor 
cur = conn.cursor()
cur.execute('SELECT * FROM table LIMIT 5')
print cursor.fetchall()

Until here there is no problem. 到这里为止没有问题。 When I want to INSERT into hive. 当我想INSERT蜂巢时。

Let's say I want to excute this query : INSERT INTO table2 SELECT Col1, Col2 FROM table1; 假设我要执行以下查询: INSERT INTO table2 SELECT Col1, Col2 FROM table1;

I tried : 我试过了 :

cur.execute('INSERT INTO table2 SELECT Col1, Col2 FROM table1')

I recieve this error 我收到此错误

pyhive.exc.OperationalError: TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage=u'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask', sqlState=u'08S01', infoMessages=[u'*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:28:27', u'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:388', u'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:244', u'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:279', u'org.apache.hive.service.cli.operation.Operation:run:Operation.java:324', u'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:499', u'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:475', u'sun.reflect.GeneratedMethodAccessor81:invoke::-1', u'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', u'java.lang.reflect.Method:invoke:Method.java:498', u'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', u'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', u'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', u'java.security.AccessController:doPrivileged:AccessController.java:-2', u'javax.security.auth.Subject:doAs:Subject.java:422', u'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1698', u'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', u'com.sun.proxy.$Proxy33:executeStatement::-1', u'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:270', u'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:507', u'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437', u'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422', u'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', u'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', u'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', u'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', u'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', u'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', u'java.lang.Thread:run:Thread.java:748'], statusCode=3), operationHandle=None)

If I excute the same query in hive directly everything run well. 如果我直接在蜂巢中执行相同的查询,则一切运行良好。 Any thoughts? 有什么想法吗?

NB: All my tables are external 注意:我所有的桌子都是外部的

CREATE EXTERNAL TABLE IF NOT EXISTS table ( col1 String, col2 String) stored as orc LOCATION 's3://somewhere' tblproperties ("orc.compress"="SNAPPY");

The solution was to add the username in the connection line; 解决方法是在连接行中添加用户名。 conn = hive.Connection(host="HOST", username="USER")

From what I understand hive queries divided on many type of operations (jobs). 据我了解,配置单元查询分为多种类型的操作(作业)。 While you are performing a simple query ( ie. SELECT * FROM table ) This reads data from the hive metastore no mapReduce job or tmp tables needed to perform the query. 在执行简单查询时( ie. SELECT * FROM table ),这将从配置单元metastore中读取数据,而无需执行查询的mapReduce作业或tmp表。 But as soon as you switch to more complicated queries (ie. using JOINs) you end up having the same error. 但是,一旦切换到更复杂的查询(即使用JOIN),您最终将遇到相同的错误。

The file code looks like this: 文件代码如下所示:

# Import hive module and connect
from pyhive import hive
conn = hive.Connection(host="HOST", username="USER")
cur = conn.cursor()
query = "INSERT INTO table2 SELECT Col1, Col2 FROM table1"
cur.execute(query)

So maybe it needs permission or something.. I will search more about this behavior and update the answer. 因此,它可能需要许可或其他内容。.我将搜索更多有关此行为的信息并更新答案。

I'm not sure how to insert a pandas df using pyhive, but if you have pyspark installed, one option is that you could convert to a spark df and use pyspark to do it. 我不确定如何使用pyhive插入pandas df,但是如果您安装了pyspark,一种选择是可以转换为spark df并使用pyspark进行操作。

from pyspark.sql import sqlContext
spark_df = sqlContext.createDataFrame(pandas_df)
spark_df.write.mode('append').saveAsTable('database_name.table_name')

You can do the following using spark . 您可以使用spark执行以下操作。

from pyspark.sql import sqlContext

# convert the pandas data frame to spark data frame
spark_df = sqlContext.createDataFrame(pandas_df)

# register the spark data frame as temp table
spark_df.registerTempTable("my_temp_table")

# execute insert statement using spark sql
sqlContext,sql("insert into hive_table select * from my_temp_table")

This will insert data in your data frame to a hive table. 这会将data framedata frame插入到hive表中。

Hope this helps you 希望这对您有帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM