简体   繁体   English

如何从 PySpark 中一个 DataFrame 的每一行生成,然后减少一组大量的 DataFrame?

[英]How to generate, then reduce, a massive set of DataFrames from each row of one DataFrame in PySpark?

I cannot share my actual code or data, unfortunately, as it is proprietary, but I can produce a MWE if the problem isn't clear to readers from the text.不幸的是,我无法分享我的实际代码或数据,因为它是专有的,但是如果读者从文本中不清楚问题,我可以生成 MWE。

I am working with a dataframe containing ~50 million rows, each of which contains a large XML document.我正在处理一个包含约 5000 万行的数据框,每行都包含一个大型 XML 文档。 From each XML document, I extract a list of statistics relating to the number of occurrences and hierarchical relationships between tags (nothing like undocumented XML formats to brighten one's day).从每个 XML 文档中,我提取了与标签之间的出现次数和层次关系相关的统计信息列表(与未记录的 XML 格式不同,这让人们眼前一亮)。 I can express these statistics in dataframes, and I can combine these dataframes over multiple documents using standard operations like GROUP BY/SUM and DISTINCT.我可以在数据帧中表达这些统计数据,并且可以使用 GROUP BY/SUM 和 DISTINCT 等标准操作将这些数据帧合并到多个文档中。 The goal is to extract the statistics for all 50 million documents and express them in a single dataframe.目标是提取所有 5000 万个文档的统计信息,并将它们表达在单个数据框中。

The problem is that I don't know how to efficiently generate 50 million dataframes from each row of one dataframe in Spark, or how to tell Spark to reduce a list of 50 million dataframes to one dataframe using binary operators.问题是我不知道如何从 Spark 中一个数据帧的每一行有效地生成 5000 万个数据帧,或者如何告诉 Spark 使用二元运算符将 5000 万个数据帧的列表减少到一个数据帧。 Are there standard functions that do these things?有做这些事情的标准函数吗?

So far, the only workaround I have found is massively inefficient (storing the data as a string, parsing it, doing the computations, and then converting it back into a string).到目前为止,我发现的唯一解决方法是非常低效(将数据存储为字符串,解析它,进行计算,然后将其转换回字符串)。 It would take weeks to finish using this method, so it isn't practical.使用此方法需要数周才能完成,因此不切实际。

The extractions and statistical data from each XML response for each row can be stored in additional columns of the row itself.从每一行的每个 XML 响应中提取的数据和统计数据可以存储在行本身的附加列中。 That way spark should be able to do the processes in its multiple executors improving the performance.这样,spark 应该能够在其多个执行程序中执行流程以提高性能。 Here is a pseudocode.这是一个伪代码。

from pyspark.sql.types import StructType, StructField, IntegerType, 
StringType, DateType, FloatType, ArrayType

def extract_mobile_metrics_from_json(row):
    j = row['xmlResponse'] # assuming your xml column name is xmlResponse
    # perform your xml extractions and computations for the xmlResponse in python
    ...
    load_date = ...
    stats_data1 = ...
    
    return Row(load_date, stats_data1, stats_data2, stats_group)

  
schema = schema = StructType([StructField('load_date', DateType()),
                     StructField('stats_data1', FloatType()),
                     StructField('stats_data2', ArrayType(IntegerType())),
                     StructField('stats_group', StringType())
                     ])
df_with_xml_stats = original_df.rdd\
                            .map(extract_metrics_from_xml)\
                            .toDF(schema=schema, sampleRatio=1)\
                            .cache()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何为一张表中每一行的不同可能性(包括空值)生成一张一行的表? - How to generate a table with one row for each count of distinct possibilities from one column, including null? 在包含日期范围行的表中,从每行生成一行,包含使用小时数 - In a table containing rows of date ranges, from each row, generate one row per day containing hours of utilization 如何将pyspark或sql中的1行4列dataframe转换为4行2列dataframe - How to convert 1 row 4 columns dataframe to 4 rows 2 columns dataframe in pyspark or sql 从SQL Server中的大型表中删除时,如何减少事务日志的使用? - How can I reduce transaction log usage when deleting from a massive table in SQL Server? 如何在mysql表的新列中为每一行生成唯一值? - How can one generate unique values for each row in a new column in a mysql table? 如何将一张表中的几行与另一张表中的每一行合并为单独的列? - How to combine several rows from one table with one row from another with each row a separate column? SQL 如何 select n 行从一列的每个间隔 - SQL how to select n row from each interval of one column 如何将一个表中的每一行复制到另一个表中 - How to store copy each row from one table into another table 使用每行计算值的 where 子句进行大规模更新? - Massive update with where clause with calculated value for each row? 如何在SQL中递归特定时间段后减少行集 - How to reduce row set after a specific timespan recursively in SQL
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM