[英]Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space
[英]Creating dictionary from large Pyspark dataframe showing OutOfMemoryError: Java heap space
我已經看到並嘗試了許多關於此問題的現有StackOverflow 帖子,但沒有一個有效。 我猜我的 JAVA 堆空間沒有我的大型數據集預期的那么大,我的數據集包含 650 萬行。 我的 Linux 實例包含 64GB Ram with 4 cores 。 根據這個建議,我需要修復我的代碼,但我認為從 pyspark dataframe 制作字典應該不會很昂貴。 如果有其他方法可以計算,請告訴我。
I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,
property_sql_df.show()
顯示,
+--------------+------------+--------------------+--------------------+
| id|country_code| name| hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
| BOND-9129450| US|Scotron Home w/Ga...|90cb0946cf4139e12...|
| BOND-1742850| US|Sited in the Mead...|d5c301f00e9966483...|
| BOND-3211356| US|NEW LISTING - Com...|811fa26e240d726ec...|
| BOND-7630290| US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
| BOND-7175508| US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+
我想要的是制作一個字典,其中 hash_of_cc_pn_li 作為鍵,id 作為列表值。
預計 Output
{
"90cb0946cf4139e12": ["BOND-9129450", "BOND-7175508"]
"d5c301f00e9966483": ["BOND-1742850","BOND-7630290"]
}
到目前為止我所嘗試的,
方式1:導致java.lang.OutOfMemoryError: Java堆空間
%%time
duplicate_property_list = {}
for ind in property_sql_df.collect():
hashed_value = ind.hash_of_cc_pn_li
property_id = ind.id
if hashed_value in duplicate_property_list:
duplicate_property_list[hashed_value].append(property_id)
else:
duplicate_property_list[hashed_value] = [property_id]
方式 2:由於缺少 pyspark 上的原生 OFFSET 而無法工作
%%time
i = 0
limit = 1000000
for offset in range(0, total_record,limit):
i = i + 1
if i != 1:
offset = offset + 1
duplicate_property_list = {}
duplicate_properties = {}
# Preparing dataframe
url = '''select id, hash_of_cc_pn_li from properties_df LIMIT {} OFFSET {}'''.format(limit,offset)
properties_sql_df = spark.sql(url)
# Grouping dataset
rows = properties_sql_df.groupBy("hash_of_cc_pn_li").agg(F.collect_set("id").alias("ids")).collect()
duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }
# Filter a dictionary to keep elements only where duplicate cound
duplicate_properties = filterTheDict(duplicate_property_list, lambda elem : len(elem[1]) >=2)
# Writing to file
with open('duplicate_detected/duplicate_property_list_all_'+str(i)+'.json', 'w') as fp:
json.dump(duplicate_property_list, fp)
我現在在控制台上得到什么:
java.lang.OutOfMemoryError: Java 堆空間
並在Jupyter 筆記本 output上顯示此錯誤
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)
這是我在這里問的后續問題: 從 Pyspark dataframe 顯示 OutOfMemoryError: Java 堆空間創建字典
為什么不在 Executors 中保留盡可能多的數據和處理,而不是收集到 Driver? If I understand this correctly, you could use pyspark
transformations and aggregations and save directly to JSON, therefore leveraging executors, then load that JSON file (likely partitioned) back into Python as a dictionary. 誠然,您引入了 IO 開銷,但這應該可以讓您繞過 OOM 堆空間錯誤。 一步步:
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
data = [
("BOND-9129450", "90cb"),
("BOND-1742850", "d5c3"),
("BOND-3211356", "811f"),
("BOND-7630290", "d5c3"),
("BOND-7175508", "90cb"),
]
df = spark.createDataFrame(data, ["id", "hash_of_cc_pn_li"])
df.groupBy(
f.col("hash_of_cc_pn_li"),
).agg(
f.collect_set("id").alias("id") # use f.collect_list() here if you're not interested in deduplication of BOND-XXXXX values
).write.json("./test.json")
檢查 output 路徑:
ls -l ./test.json
-rw-r--r-- 1 jovyan users 0 Jul 27 08:29 part-00000-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 50 Jul 27 08:29 part-00039-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00043-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00159-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 0 Jul 27 08:29 _SUCCESS
_SUCCESS
作為dict
加載到 Python :
import json
from glob import glob
data = []
for file_name in glob('./test.json/*.json'):
with open(file_name) as f:
try:
data.append(json.load(f))
except json.JSONDecodeError: # there is definitely a better way - this is here because some partitions might be empty
pass
最后
{item['hash_of_cc_pn_li']:item['id'] for item in data}
{'d5c3': ['BOND-7630290', 'BOND-1742850'],
'811f': ['BOND-3211356'],
'90cb': ['BOND-9129450', 'BOND-7175508']}
我希望這有幫助! 謝謝你的好問題!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.