简体   繁体   中英

How to split the pyspark dataframe based on the content of the line

I want to split the file according to the first character of the line in Pyspark DataFrame.

The original data has a column, the data includes

  1. the file name (such as 'DATE20191009')

  2. the file content (such as '1', '2', '3')

Input Sample File(Pyspark DataFrame):

column1

Date20191009

1

2

3

Date20191010

1

4

5

I want to get a Pyspark DataFrame with the file name as a split of the data.

The file name is placed in the column1 of the DataFrame, and the contents of the file are placed in the column2 of the DataFrame.

Expected Output(Pyspark DataFrame)

column1  column2
Date20191009 [1,2,3]
Date20191010 [1,4,5]

I tried the Pandas Dataframe and Pyspark DataFrame.collect() , but both failed due to excessive data volume (more than 9 million rows).

>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import *
>>> w = Window.rowsBetween(Window.unboundedPreceding, 0)

   #Input DataFrame

>>> df.show()
+------------+
|     column1|
+------------+
|Date20191009|
|           1|
|           2|
|           3|
|Date20191010|
|           1|
|           4|
|           5|
+------------+

>>> df1 = df.withColumn('tmp', when(df.column1.startswith('Date'), df.column1).otherwise(None)).withColumn('temp', last('tmp', True).over(w)).drop('tmp')
>>> df1.show()

+------------+------------+
|     column1|        temp|
+------------+------------+
|Date20191009|Date20191009|
|           1|Date20191009|
|           2|Date20191009|
|           3|Date20191009|
|Date20191010|Date20191010|
|           1|Date20191010|
|           4|Date20191010|
|           5|Date20191010|
+------------+------------+

>>> df1.filter(df1.column1 != df1.temp).groupBy(df1.temp).agg(concat_ws(',',collect_list(df1.column1)).alias('column2')).withColumnRenamed("temp", "column1").show()

+------------+-------+
|     column1|column2|
+------------+-------+
|Date20191009|  1,2,3|
|Date20191010|  1,4,5|
+------------+-------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM