简体   繁体   中英

How to generate, then reduce, a massive set of DataFrames from each row of one DataFrame in PySpark?

I cannot share my actual code or data, unfortunately, as it is proprietary, but I can produce a MWE if the problem isn't clear to readers from the text.

I am working with a dataframe containing ~50 million rows, each of which contains a large XML document. From each XML document, I extract a list of statistics relating to the number of occurrences and hierarchical relationships between tags (nothing like undocumented XML formats to brighten one's day). I can express these statistics in dataframes, and I can combine these dataframes over multiple documents using standard operations like GROUP BY/SUM and DISTINCT. The goal is to extract the statistics for all 50 million documents and express them in a single dataframe.

The problem is that I don't know how to efficiently generate 50 million dataframes from each row of one dataframe in Spark, or how to tell Spark to reduce a list of 50 million dataframes to one dataframe using binary operators. Are there standard functions that do these things?

So far, the only workaround I have found is massively inefficient (storing the data as a string, parsing it, doing the computations, and then converting it back into a string). It would take weeks to finish using this method, so it isn't practical.

The extractions and statistical data from each XML response for each row can be stored in additional columns of the row itself. That way spark should be able to do the processes in its multiple executors improving the performance. Here is a pseudocode.

from pyspark.sql.types import StructType, StructField, IntegerType, 
StringType, DateType, FloatType, ArrayType

def extract_mobile_metrics_from_json(row):
    j = row['xmlResponse'] # assuming your xml column name is xmlResponse
    # perform your xml extractions and computations for the xmlResponse in python
    ...
    load_date = ...
    stats_data1 = ...
    
    return Row(load_date, stats_data1, stats_data2, stats_group)

  
schema = schema = StructType([StructField('load_date', DateType()),
                     StructField('stats_data1', FloatType()),
                     StructField('stats_data2', ArrayType(IntegerType())),
                     StructField('stats_group', StringType())
                     ])
df_with_xml_stats = original_df.rdd\
                            .map(extract_metrics_from_xml)\
                            .toDF(schema=schema, sampleRatio=1)\
                            .cache()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM