简体   繁体   English

在 PySpark 中 - 如果列表中的值位于不同的 DataFrame 的行中,如何在 PySpark 中创建新的 DataFrame?

[英]In PySpark - How to create a new DataFrame in PySpark if values from list are in row of a different DataFrame?

I have a sample DataFrame in the "pyspark.sql.dataframe.DataFrame" format:我有一个“pyspark.sql.dataframe.DataFrame”格式的样本 DataFrame:

| ID | SampleColumn1| SampleColumn2 | SampleColumn3|
|--- |--------------| ------------  | ------------ |
| 1  |sample Apple  | sample Cherry | sample Lime  |
| 2  |sample Cherry | sample lemon  | sample Grape |

I would like to create a new DataFrame based off of this initial dataframe. Should one of several values in a list [Apple, Lime, Cherry] be in ANY of the columns for a row, it would appear as a 1 in the new dataframe for its column.我想基于这个初始的 dataframe 创建一个新的 DataFrame。如果列表 [Apple、Lime、Cherry] 中的几个值之一位于一行的任何列中,它将在新的 dataframe 中显示为 1为其专栏。 In this case, the output should be:在这种情况下,output 应该是:

listOfValues = ['Apple','Lime','Cherry']

| ID | Apple | Lime | Cherry |
| 1  |  1    |  1   |    1   |
| 2  |  0    |  0   |    1   |

I currently have the following using normal Pandas:我目前有以下使用正常的 Pandas:

keywords = ['Apple', 'Lime', 'Cherry']
tmp = (df.melt(ignore_index=False)
        .value.str.extract(
            f'({"|".join(keywords)})',
            expand=False)
        .dropna())

res = (pd.crosstab(index=tmp.index, columns=tmp)
        .rename_axis(index=None, columns=None))

I would like to achieve this output but however I would like to use PySpark as the current platform does not allow use of Pandas or normal Python commands.我想实现这个 output 但是我想使用 PySpark 因为当前平台不允许使用 Pandas 或正常的 Python 命令。

Concat all columns, iterate over each keyword and check if it exists in the new concat column.连接所有列,遍历每个关键字并检查它是否存在于新的连接列中。 This gives you True & False .这给你True & False If you are interested in 1 & 0 , then use when() & otherwise().如果您对10感兴趣,请使用 when() 和 otherwise()。

df = spark.createDataFrame(data=[["1","sample Apple","sample Cherry","sample Lime"],["2","sample Cherry","sample lemon","sample Grape"],["3","sample nothing","sample nothing","sample nothing"]], schema=["ID","SampleColumn1","SampleColumn2","SampleColumn3"])
keywords = ['Apple', 'Lime', 'Cherry']
columns = [c for c in df.columns if c != "ID"]

df = df.select("ID", F.concat_ws(" ", *columns).alias("all"))

for k in keywords:
  df = df.withColumn(k, F.when(F.lower(F.col("all")).contains(k.lower()), F.lit(1)).otherwise(F.lit(0)))

df = df.drop("all")

[Out]:
+---+-----+----+------+
| ID|Apple|Lime|Cherry|
+---+-----+----+------+
|  1|    1|   1|     1|
|  2|    0|   0|     1|
|  3|    0|   0|     0|
+---+-----+----+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM