简体   繁体   中英

Java Heap Space error when exporting spark dataframe to hive database

I am using pyspark to do some text analysis on a table in Hive. I use the following code

from pyspark.sql import SQLContext, Row, HiveContext
from pyspark.sql.functions import col, udf, StringType
from pyspark.sql.types import *
from pyspark import SparkContext
hc = HiveContext(sc)
df=hc.sql("select * from table1")
def cleaning_text(sentence):
   sentence=sentence.lower()
   sentence=re.sub('\'',' ',sentence)
   cleaned=' '.join([w for w in cleaned.split() if not len(w)<=2 ])
   return cleaned

org_val=udf(cleaning_text,StringType())
data=df.withColumn("cleaned",org_val(df.text))

data_1=data.select('uniqueid','cleaned','parsed')#2630789 #2022395
tokenizer = Tokenizer(inputCol="cleaned", outputCol="words")
wordsData = tokenizer.transform(data_1)

hc.sql("SET spark.sql.hive.convertMetastoreParquet=false")
hc.sql("create table table2 (uniqueid string, cleaned string, parsed string)")
wordsData.insertInto('table2')

I can do

words_data.show(2)

however as I try to export it, it gives me this error

INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
Exception in thread "stdout writer for python" 17/02/02 15:18:44 ERROR Utils: Uncaught exception in thread stdout writer for python
java.lang.OutOfMemoryError: Java heap space

I do not mind if this gets exported as a text file too.

I was running this script on the spark shell which defaults to driver memory of 1g.

I changed it by running the statement below while starting the spark shell

pyspark --driver-memory 10g

This solved my problem

while you are inserting into table you should write the insert statement in the hiveContext as it is writing to a hive table.

hc.sql("SET spark.sql.hive.convertMetastoreParquet=false") hc.sql("create table table2 (uniqueid string, cleaned string, parsed string)") wordsData.registerTempTable("tb1") val df1 = hc.sql("insert into table table2 select * from tb1")

If the above one doesn't work or if it is not satisfying to you try the below where you can directly saveAsTable (make sure a table is already created in your desired schema)

wordsData.write.mode("append").saveAsTable("sample_database.sample_tablename") If you get any errors by trying above ones please paste the errors here, i will help you further

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM