How to split the pyspark dataframe based on the content of the line

Question

I want to split the file according to the first character of the line in Pyspark DataFrame.

The original data has a column, the data includes

the file name (such as 'DATE20191009')
the file content (such as '1', '2', '3')

Input Sample File(Pyspark DataFrame):

column1

Date20191009

1

2

3

Date20191010

1

4

5

I want to get a Pyspark DataFrame with the file name as a split of the data.

The file name is placed in the column1 of the DataFrame, and the contents of the file are placed in the column2 of the DataFrame.

Expected Output(Pyspark DataFrame)

column1  column2
Date20191009 [1,2,3]
Date20191010 [1,4,5]

I tried the Pandas Dataframe and Pyspark DataFrame.collect() , but both failed due to excessive data volume (more than 9 million rows).

Answer 1

>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import *
>>> w = Window.rowsBetween(Window.unboundedPreceding, 0)

   #Input DataFrame

>>> df.show()
+------------+
|     column1|
+------------+
|Date20191009|
|           1|
|           2|
|           3|
|Date20191010|
|           1|
|           4|
|           5|
+------------+

>>> df1 = df.withColumn('tmp', when(df.column1.startswith('Date'), df.column1).otherwise(None)).withColumn('temp', last('tmp', True).over(w)).drop('tmp')
>>> df1.show()

+------------+------------+
|     column1|        temp|
+------------+------------+
|Date20191009|Date20191009|
|           1|Date20191009|
|           2|Date20191009|
|           3|Date20191009|
|Date20191010|Date20191010|
|           1|Date20191010|
|           4|Date20191010|
|           5|Date20191010|
+------------+------------+

>>> df1.filter(df1.column1 != df1.temp).groupBy(df1.temp).agg(concat_ws(',',collect_list(df1.column1)).alias('column2')).withColumnRenamed("temp", "column1").show()

+------------+-------+
|     column1|column2|
+------------+-------+
|Date20191009|  1,2,3|
|Date20191010|  1,4,5|
+------------+-------+

How to split the pyspark dataframe based on the content of the line

Question

1 answers

solution1
0 ACCPTED 2019-10-11 08:11:55

How to split the pyspark dataframe based on the content of the line

Question

1 answers

solution1 0 ACCPTED 2019-10-11 08:11:55

solution1
0 ACCPTED 2019-10-11 08:11:55