So, lets say I have a dataframe df with one single column, where df.show()
is |a,b,c,d,....| |a,b,c,d,....| so I want to get a df1, where df1.show()
is |a|b|c.....| so to put it shortly I want to break a dataframe with one single column into a dataframe with multiple columns. So, I get the idea of
split_col = pyspark.sql.functions.split(df['x'], ' '),
df=df.withColumn('0',split_col.getItem(0))
df=df.withColumn('1',split_col.getItem(1))
, so on but if I have tons of columns. Is there any way to do this in pyspark except for huge iteration over this? Thanks
So you could iterate and set name
using the select clause
as shown below:
In this one you will be hitting split
every time the loop runs so it will be less efficient.
from pyspark.sql import functions as F
df.select(*[(F.split("x",' ')[i]).alias(str(i)) for i in range(100)]).explain()
#== Physical Plan ==
#*(1) Project [split(x#200, )[0] AS 0#1708, split(x#200, )[1]
AS 1#1709, split(x#200, )[2] AS 2#1710, split(x#200, )[3] AS
3#1711, split(x#200, )[4] AS 4#1712, split(x#200, )[5] AS
5#1713, split(x#200, )[6] AS 6#1714, split(x#200, )[7] AS
7#1715, split(x#200, )[8] AS 8#1716, split(x#200, )[9] AS
9#1717, split(x#200, )[10] AS 10#1718, split(x#200, )[11] AS
11#1719, split(x#200, )[12] AS 12#1720, split(x#200, )[13] AS
13#1721, split(x#200, )[14] AS 14#1722, split(x#200, )[15] AS
15#1723, split(x#200, )[16] AS 16#1724, split(x#200, )[17] AS
17#1725, split(x#200, )[18] AS 18#1726, split(x#200, )[19] AS
19#1727, split(x#200, )[20] AS 20#1728, split(x#200, )[21] AS
21#1729, split(x#200, )[22] AS 22#1730, split(x#200, )[23] AS
23#1731, ... 76 more fields]
#+- *(1) Scan ExistingRDD[x#200]
Instead you could split it once
, and allow spark to project
only one split operation as opposed to many.
from pyspark.sql import functions as F
df\
.withColumn("x", F.split('x',' '))\
.select(*[(F.col("x")[i]).alias(str(i)) for i in range(100)]).drop("x").explain()
#== Physical Plan ==
#*(1) Project [x#1908[0] AS 0#1910, x#1908[1] AS 1#1911,
x#1908[2] AS 2#1912, x#1908[3] AS 3#1913, x#1908[4] AS 4#1914,
x#1908[5] AS 5#1915, x#1908[6] AS 6#1916, x#1908[7] AS 7#1917,
x#1908[8] AS 8#1918, x#1908[9] AS 9#1919, x#1908[10] AS 10#1920,
x#1908[11] AS 11#1921, x#1908[12] AS 12#1922, x#1908[13] AS
13#1923, x#1908[14] AS 14#1924, x#1908[15] AS 15#1925, x#1908[16]
AS 16#1926, x#1908[17] AS 17#1927, x#1908[18] AS 18#1928,
x#1908[19] AS 19#1929, x#1908[20] AS 20#1930, x#1908[21] AS
21#1931, x#1908[22] AS 22#1932, x#1908[23] AS 23#1933, ... 76
more fields]
+- *(1) Project [split(x#200, ) AS x#1908]
+- *(1) Scan ExistingRDD[x#200]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.