简体   繁体   English

当第 n 列值为“x”时,访问 Pyspark 数据帧的第 (n+1) 列

[英]Access Pyspark dataframe's (n+1)th column when nth column value is 'x'

I have a dataframe with 1 million rows and 200 columns.我有一个具有 100 万行和 200 列的 dataframe。 I have to put only a few columns in my final dataframe.我只需要在最终的 dataframe 中添加几列。 If for every row one of the column value is 3300, i need to put the next column's value in my dataframe.如果每一行的一列值为 3300,我需要将下一列的值放入我的 dataframe 中。

For eg:例如:

在此处输入图像描述

Here i have col3 value as 3300 so i need to put col4 in my final dataframe.这里我的 col3 值为 3300,所以我需要将 col4 放入我的最终 dataframe 中。 Using column name won't be a good solution because i have 200 columns.使用列名不是一个好的解决方案,因为我有 200 列。

One way to do this is一种方法是

df = spark.createDataFrame([(1100,1200,3300,4400,5500),(3300,1200,3200,4400,5500),(1100,1200,3200,4400,3300),(1100,3300,3300,4400,5500)],['col1','col2','col3','col4','col5'])

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|1100|1200|3300|4400|5500|
|3300|1200|3200|4400|5500|
|1100|1200|3200|4400|3300|
|1100|3300|3300|4400|5500|
+----+----+----+----+----+

from itertools import chain
column_map = create_map([lit(i) for i in chain(*enumerate(df.columns))])

df.withColumn('data',array(df.columns)).\
   withColumn('index',array_position(array(df.columns),3300)).\
   withColumn('value',col('data').getItem(col('index'))).\
   withColumn('columnName',column_map[col('index')]).\
   select('columnName','value').show()

+----------+-----+
|columnName|value|
+----------+-----+
|      col4| 4400|
|      col2| 1200|
|      null| null|
|      col3| 3300|
+----------+-----+

Update To fetch fixed cols and the value of column right after 3300 use this更新要在 3300 之后获取固定的列和列的值,请使用此

df.withColumn('data',array(df.columns)).\
   withColumn('index',array_position(array(df.columns),3300)).\
   withColumn('value',col('data').getItem(col('index'))).\
   withColumn('columnName',column_map[col('index')]).\
   select('col1','col2','value')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 修改 Pyspark 中 dataframe 的列值 - Modifying column value of a dataframe in Pyspark 在数据帧的第N列之后插入空格或空白列 - Inserting space or blank column after N'th column in a dataframe 在 Python 的 DataFrame 的列中重复第 n 次值 - repeating nth times of a value in a column of a DataFrame in Python PySpark:当列是列表时向 DataFrame 添加一列 - PySpark: Add a column to DataFrame when column is a list 如何在熊猫的DataFrame的列中提取第n个最大值/最小值? - How to extract the n-th maximum/minimum value in a column of a DataFrame in pandas? Pyspark DataFrame 列基于另一个 DataFrame 值 - Pyspark DataFrame column based on another DataFrame value PySpark DataFrame:找到最接近整数列值的数组列的索引 - PySpark DataFrame: find array column's index that's closest to integer column's value 用另一个数据框的列替换 pyspark 列/合并 pyspark 数据框 - Replace pyspark column with another dataframe's column / merge pyspark dataframes pyspark dataframe中从1到n的序列如何去除id_sum列中的0 - How to remove the 0s in the id_sum column by a sequence from 1 to n in pyspark dataframe 根据列值是否在另一列中,将列添加到PySpark DataFrame - Adding column to PySpark DataFrame depending on whether column value is in another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM