将 spark 数据帧导出到 hive 数据库时出现 Java 堆空间错误

Question

我正在使用 pyspark 对 Hive 中的表格进行一些文本分析。 我使用以下代码

from pyspark.sql import SQLContext, Row, HiveContext
from pyspark.sql.functions import col, udf, StringType
from pyspark.sql.types import *
from pyspark import SparkContext
hc = HiveContext(sc)
df=hc.sql("select * from table1")
def cleaning_text(sentence):
   sentence=sentence.lower()
   sentence=re.sub('\'',' ',sentence)
   cleaned=' '.join([w for w in cleaned.split() if not len(w)<=2 ])
   return cleaned

org_val=udf(cleaning_text,StringType())
data=df.withColumn("cleaned",org_val(df.text))

data_1=data.select('uniqueid','cleaned','parsed')#2630789 #2022395
tokenizer = Tokenizer(inputCol="cleaned", outputCol="words")
wordsData = tokenizer.transform(data_1)

hc.sql("SET spark.sql.hive.convertMetastoreParquet=false")
hc.sql("create table table2 (uniqueid string, cleaned string, parsed string)")
wordsData.insertInto('table2')

我可以

words_data.show(2)

但是当我尝试导出它时，它给了我这个错误

INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
Exception in thread "stdout writer for python" 17/02/02 15:18:44 ERROR Utils: Uncaught exception in thread stdout writer for python
java.lang.OutOfMemoryError: Java heap space

我不介意这是否也被导出为文本文件。

Answer 1

我在默认为 1g 的驱动程序内存的 spark shell 上运行此脚本。

我通过在启动 spark shell 时运行下面的语句来更改它

pyspark --driver-memory 10g

这解决了我的问题

Answer 2

当您插入表时，您应该在 hiveContext 中编写插入语句，因为它正在写入配置单元表。

hc.sql("SET spark.sql.hive.convertMetastoreParquet=false") hc.sql("create table table2 (uniqueid string, cleaned string, parsed string)") wordsData.registerTempTable("tb1") val df1 = hc.sql("insert into table table2 select * from tb1")

如果上述方法不起作用或对您不满意，请尝试下面的方法，您可以直接将其保存为表（确保已在所需的架构中创建了一个表）

wordsData.write.mode("append").saveAsTable("sample_database.sample_tablename")如果您在尝试上述错误时遇到任何错误，请在此处粘贴错误，我会进一步帮助您

将 spark 数据帧导出到 hive 数据库时出现 Java 堆空间错误

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-02-02 22:26:53

解决方案2
0 2017-02-02 21:39:57

将 spark 数据帧导出到 hive 数据库时出现 Java 堆空间错误

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-02-02 22:26:53

解决方案2 0 2017-02-02 21:39:57

解决方案1
1 已采纳 2017-02-02 22:26:53

解决方案2
0 2017-02-02 21:39:57