在 PySpark 中 - 如果列表中的值位于不同的 DataFrame 的行中，如何在 PySpark 中创建新的 DataFrame？

Question

I have a sample DataFrame in the "pyspark.sql.dataframe.DataFrame" format:我有一个“pyspark.sql.dataframe.DataFrame”格式的样本 DataFrame：

| ID | SampleColumn1| SampleColumn2 | SampleColumn3|
|--- |--------------| ------------  | ------------ |
| 1  |sample Apple  | sample Cherry | sample Lime  |
| 2  |sample Cherry | sample lemon  | sample Grape |

I would like to create a new DataFrame based off of this initial dataframe. Should one of several values in a list [Apple, Lime, Cherry] be in ANY of the columns for a row, it would appear as a 1 in the new dataframe for its column.我想基于这个初始的 dataframe 创建一个新的 DataFrame。如果列表 [Apple、Lime、Cherry] 中的几个值之一位于一行的任何列中，它将在新的 dataframe 中显示为 1为其专栏。 In this case, the output should be:在这种情况下，output 应该是：

listOfValues = ['Apple','Lime','Cherry']

| ID | Apple | Lime | Cherry |
| 1  |  1    |  1   |    1   |
| 2  |  0    |  0   |    1   |

I currently have the following using normal Pandas:我目前有以下使用正常的 Pandas：

keywords = ['Apple', 'Lime', 'Cherry']
tmp = (df.melt(ignore_index=False)
        .value.str.extract(
            f'({"|".join(keywords)})',
            expand=False)
        .dropna())

res = (pd.crosstab(index=tmp.index, columns=tmp)
        .rename_axis(index=None, columns=None))

I would like to achieve this output but however I would like to use PySpark as the current platform does not allow use of Pandas or normal Python commands.我想实现这个 output 但是我想使用 PySpark 因为当前平台不允许使用 Pandas 或正常的 Python 命令。

Answer 1

Concat all columns, iterate over each keyword and check if it exists in the new concat column.连接所有列，遍历每个关键字并检查它是否存在于新的连接列中。 This gives you True & False .这给你True & False 。 If you are interested in 1 & 0 , then use when() & otherwise().如果您对1和0感兴趣，请使用 when() 和 otherwise()。

df = spark.createDataFrame(data=[["1","sample Apple","sample Cherry","sample Lime"],["2","sample Cherry","sample lemon","sample Grape"],["3","sample nothing","sample nothing","sample nothing"]], schema=["ID","SampleColumn1","SampleColumn2","SampleColumn3"])
keywords = ['Apple', 'Lime', 'Cherry']
columns = [c for c in df.columns if c != "ID"]

df = df.select("ID", F.concat_ws(" ", *columns).alias("all"))

for k in keywords:
  df = df.withColumn(k, F.when(F.lower(F.col("all")).contains(k.lower()), F.lit(1)).otherwise(F.lit(0)))

df = df.drop("all")

[Out]:
+---+-----+----+------+
| ID|Apple|Lime|Cherry|
+---+-----+----+------+
|  1|    1|   1|     1|
|  2|    0|   0|     1|
|  3|    0|   0|     0|
+---+-----+----+------+

在 PySpark 中 - 如果列表中的值位于不同的 DataFrame 的行中，如何在 PySpark 中创建新的 DataFrame？

问题描述

1 个解决方案

解决方案1
0 2022-12-05 10:28:57

在 PySpark 中 - 如果列表中的值位于不同的 DataFrame 的行中，如何在 PySpark 中创建新的 DataFrame？

问题描述

1 个解决方案

解决方案1 0 2022-12-05 10:28:57

解决方案1
0 2022-12-05 10:28:57