[英]How to generate, then reduce, a massive set of DataFrames from each row of one DataFrame in PySpark?
I cannot share my actual code or data, unfortunately, as it is proprietary, but I can produce a MWE if the problem isn't clear to readers from the text.不幸的是,我无法分享我的实际代码或数据,因为它是专有的,但是如果读者从文本中不清楚问题,我可以生成 MWE。
I am working with a dataframe containing ~50 million rows, each of which contains a large XML document.我正在处理一个包含约 5000 万行的数据框,每行都包含一个大型 XML 文档。 From each XML document, I extract a list of statistics relating to the number of occurrences and hierarchical relationships between tags (nothing like undocumented XML formats to brighten one's day).
从每个 XML 文档中,我提取了与标签之间的出现次数和层次关系相关的统计信息列表(与未记录的 XML 格式不同,这让人们眼前一亮)。 I can express these statistics in dataframes, and I can combine these dataframes over multiple documents using standard operations like GROUP BY/SUM and DISTINCT.
我可以在数据帧中表达这些统计数据,并且可以使用 GROUP BY/SUM 和 DISTINCT 等标准操作将这些数据帧合并到多个文档中。 The goal is to extract the statistics for all 50 million documents and express them in a single dataframe.
目标是提取所有 5000 万个文档的统计信息,并将它们表达在单个数据框中。
The problem is that I don't know how to efficiently generate 50 million dataframes from each row of one dataframe in Spark, or how to tell Spark to reduce a list of 50 million dataframes to one dataframe using binary operators.问题是我不知道如何从 Spark 中一个数据帧的每一行有效地生成 5000 万个数据帧,或者如何告诉 Spark 使用二元运算符将 5000 万个数据帧的列表减少到一个数据帧。 Are there standard functions that do these things?
有做这些事情的标准函数吗?
So far, the only workaround I have found is massively inefficient (storing the data as a string, parsing it, doing the computations, and then converting it back into a string).到目前为止,我发现的唯一解决方法是非常低效(将数据存储为字符串,解析它,进行计算,然后将其转换回字符串)。 It would take weeks to finish using this method, so it isn't practical.
使用此方法需要数周才能完成,因此不切实际。
The extractions and statistical data from each XML response for each row can be stored in additional columns of the row itself.从每一行的每个 XML 响应中提取的数据和统计数据可以存储在行本身的附加列中。 That way spark should be able to do the processes in its multiple executors improving the performance.
这样,spark 应该能够在其多个执行程序中执行流程以提高性能。 Here is a pseudocode.
这是一个伪代码。
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType, DateType, FloatType, ArrayType
def extract_mobile_metrics_from_json(row):
j = row['xmlResponse'] # assuming your xml column name is xmlResponse
# perform your xml extractions and computations for the xmlResponse in python
...
load_date = ...
stats_data1 = ...
return Row(load_date, stats_data1, stats_data2, stats_group)
schema = schema = StructType([StructField('load_date', DateType()),
StructField('stats_data1', FloatType()),
StructField('stats_data2', ArrayType(IntegerType())),
StructField('stats_group', StringType())
])
df_with_xml_stats = original_df.rdd\
.map(extract_metrics_from_xml)\
.toDF(schema=schema, sampleRatio=1)\
.cache()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.