当第 n 列值为“x”时，访问 Pyspark 数据帧的第 (n+1) 列

Question

I have a dataframe with 1 million rows and 200 columns.我有一个具有 100 万行和 200 列的 dataframe。 I have to put only a few columns in my final dataframe.我只需要在最终的 dataframe 中添加几列。 If for every row one of the column value is 3300, i need to put the next column's value in my dataframe.如果每一行的一列值为 3300，我需要将下一列的值放入我的 dataframe 中。

For eg:例如：

在此处输入图像描述

Here i have col3 value as 3300 so i need to put col4 in my final dataframe.这里我的 col3 值为 3300，所以我需要将 col4 放入我的最终 dataframe 中。 Using column name won't be a good solution because i have 200 columns.使用列名不是一个好的解决方案，因为我有 200 列。

Answer 1

One way to do this is一种方法是

df = spark.createDataFrame([(1100,1200,3300,4400,5500),(3300,1200,3200,4400,5500),(1100,1200,3200,4400,3300),(1100,3300,3300,4400,5500)],['col1','col2','col3','col4','col5'])

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|1100|1200|3300|4400|5500|
|3300|1200|3200|4400|5500|
|1100|1200|3200|4400|3300|
|1100|3300|3300|4400|5500|
+----+----+----+----+----+

from itertools import chain
column_map = create_map([lit(i) for i in chain(*enumerate(df.columns))])

df.withColumn('data',array(df.columns)).\
   withColumn('index',array_position(array(df.columns),3300)).\
   withColumn('value',col('data').getItem(col('index'))).\
   withColumn('columnName',column_map[col('index')]).\
   select('columnName','value').show()

+----------+-----+
|columnName|value|
+----------+-----+
|      col4| 4400|
|      col2| 1200|
|      null| null|
|      col3| 3300|
+----------+-----+

Update To fetch fixed cols and the value of column right after 3300 use this更新要在 3300 之后获取固定的列和列的值，请使用此

df.withColumn('data',array(df.columns)).\
   withColumn('index',array_position(array(df.columns),3300)).\
   withColumn('value',col('data').getItem(col('index'))).\
   withColumn('columnName',column_map[col('index')]).\
   select('col1','col2','value')

当第 n 列值为“x”时，访问 Pyspark 数据帧的第 (n+1) 列

问题描述

1 个解决方案

解决方案1
1 2020-07-06 07:01:00

当第 n 列值为“x”时，访问 Pyspark 数据帧的第 (n+1) 列

问题描述

1 个解决方案

解决方案1 1 2020-07-06 07:01:00

解决方案1
1 2020-07-06 07:01:00