如何从 Pyspark 数据帧的每一行中提取我不知道的值

Question

I have a dataframe like this:我有一个这样的数据框：

item_A  item_B item_C
  x       z      y
  z       x      y
  y       x      z
  z       y      x

where all values are a string and I only know the value of x and y but i need to get the value of z .其中所有值都是字符串，我只知道x和y的值，但我需要获取z的值。 The problem is that z is not always in the same column.问题是z并不总是在同一列中。 I want to add a column only with the value of z .我只想添加一个值为z的列。 I tried concatenating the columns and extract the others strings that I know but I don't know how to keep z (with a main_pattern = r'x|y' ?)我尝试连接列并提取我知道但我不知道如何保留z的其他字符串（使用main_pattern = r'x|y' ？）

Here is what I tried but isn't working这是我尝试过但不起作用的方法

pattern_full = r'(('+ main_pattern+'),)'
df = df.withColumn("vale_z", regexp_extract("columns_concatenated", pattern_full, 1)

Answer 1

We can create an array with all 3 columns and use array_except() to remove the known values which leaves us with the "Z" .我们可以创建一个包含所有 3 列的数组，并使用array_except()删除已知值，这给我们留下了"Z" 。

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3'])

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   x|   z|   y|
# |   z|   y|   x|
# |   x|   y|   y|
# |   x|   z|   z|
# +----+----+----+

data_sdf. \
    withColumn('col_array', func.array(func.col('col1'), func.col('col2'), func.col('col3'))). \
    withColumn('z_val_arr', func.array_except('col_array', func.array(func.lit('x'), func.lit('y')))). \
    withColumn('z_val', func.col('z_val_arr')[0]). \
    show(truncate=False)

# +----+----+----+---------+---------+-----+
# |col1|col2|col3|col_array|z_val_arr|z_val|
# +----+----+----+---------+---------+-----+
# |x   |z   |y   |[x, z, y]|[z]      |z    |
# |z   |y   |x   |[z, y, x]|[z]      |z    |
# |x   |y   |y   |[x, y, y]|[]       |null |
# |x   |z   |z   |[x, z, z]|[z]      |z    |
# +----+----+----+---------+---------+-----+

Answer 2

d1 = [['x', 'z', 'y'], ['z', 'x', 'y'], ['y', 'x', 'z'], ['z', 'y', 'x'], ['u', 'v', 'w']]
df1 = spark.createDataFrame(d1, ['item_A', 'item_B', 'item_C'])

df1.withColumn('columns_concatenated', concat(*df1.columns))\
    .withColumn('find_z', regexp_extract(col('columns_concatenated'), '(z)', 1))\
    .show(10, False)
+------+------+------+--------------------+------+
|item_A|item_B|item_C|columns_concatenated|find_z|
+------+------+------+--------------------+------+
|x     |z     |y     |xzy                 |z     |
|z     |x     |y     |zxy                 |z     |
|y     |x     |z     |yxz                 |z     |
|z     |y     |x     |zyx                 |z     |
|u     |v     |w     |uvw                 |      |
+------+------+------+--------------------+------+

如何从 Pyspark 数据帧的每一行中提取我不知道的值

问题描述

2 个解决方案

解决方案1
0 2022-07-22 05:17:17

解决方案2
0 2022-07-22 06:09:44

如何从 Pyspark 数据帧的每一行中提取我不知道的值

问题描述

2 个解决方案

解决方案1 0 2022-07-22 05:17:17

解决方案2 0 2022-07-22 06:09:44

解决方案1
0 2022-07-22 05:17:17

解决方案2
0 2022-07-22 06:09:44