简体   繁体   中英

How to iterate over dataframe multiple columns in pyspark?

So, lets say I have a dataframe df with one single column, where df.show() is |a,b,c,d,....| |a,b,c,d,....| so I want to get a df1, where df1.show() is |a|b|c.....| so to put it shortly I want to break a dataframe with one single column into a dataframe with multiple columns. So, I get the idea of

split_col = pyspark.sql.functions.split(df['x'], ' '),
df=df.withColumn('0',split_col.getItem(0))
df=df.withColumn('1',split_col.getItem(1))

, so on but if I have tons of columns. Is there any way to do this in pyspark except for huge iteration over this? Thanks

So you could iterate and set name using the select clause as shown below:

In this one you will be hitting split every time the loop runs so it will be less efficient.

from pyspark.sql import functions as F

df.select(*[(F.split("x",' ')[i]).alias(str(i)) for i in range(100)]).explain()

#== Physical Plan ==
#*(1) Project [split(x#200,  )[0] AS 0#1708, split(x#200,  )[1] 
AS 1#1709, split(x#200,  )[2] AS 2#1710, split(x#200,  )[3] AS 
3#1711, split(x#200,  )[4] AS 4#1712, split(x#200,  )[5] AS 
5#1713, split(x#200,  )[6] AS 6#1714, split(x#200,  )[7] AS 
7#1715, split(x#200,  )[8] AS 8#1716, split(x#200,  )[9] AS 
9#1717, split(x#200,  )[10] AS 10#1718, split(x#200,  )[11] AS 
11#1719, split(x#200,  )[12] AS 12#1720, split(x#200,  )[13] AS 
13#1721, split(x#200,  )[14] AS 14#1722, split(x#200,  )[15] AS 
15#1723, split(x#200,  )[16] AS 16#1724, split(x#200,  )[17] AS 
17#1725, split(x#200,  )[18] AS 18#1726, split(x#200,  )[19] AS 
19#1727, split(x#200,  )[20] AS 20#1728, split(x#200,  )[21] AS 
21#1729, split(x#200,  )[22] AS 22#1730, split(x#200,  )[23] AS 
23#1731, ... 76 more fields]
#+- *(1) Scan ExistingRDD[x#200]

Instead you could split it once , and allow spark to project only one split operation as opposed to many.

from pyspark.sql import functions as F


df\
  .withColumn("x", F.split('x',' '))\
  .select(*[(F.col("x")[i]).alias(str(i)) for i in range(100)]).drop("x").explain()

#== Physical Plan ==
#*(1) Project [x#1908[0] AS 0#1910, x#1908[1] AS 1#1911, 
x#1908[2] AS 2#1912, x#1908[3] AS 3#1913, x#1908[4] AS 4#1914, 
x#1908[5] AS 5#1915, x#1908[6] AS 6#1916, x#1908[7] AS 7#1917, 
x#1908[8] AS 8#1918, x#1908[9] AS 9#1919, x#1908[10] AS 10#1920, 
x#1908[11] AS 11#1921, x#1908[12] AS 12#1922, x#1908[13] AS 
13#1923, x#1908[14] AS 14#1924, x#1908[15] AS 15#1925, x#1908[16] 
AS 16#1926, x#1908[17] AS 17#1927, x#1908[18] AS 18#1928, 
x#1908[19] AS 19#1929, x#1908[20] AS 20#1930, x#1908[21] AS 
21#1931, x#1908[22] AS 22#1932, x#1908[23] AS 23#1933, ... 76 
more fields]
+- *(1) Project [split(x#200,  ) AS x#1908]
   +- *(1) Scan ExistingRDD[x#200]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM