简体   繁体   中英

In PySpark - How to create a new DataFrame in PySpark if values from list are in row of a different DataFrame?

I have a sample DataFrame in the "pyspark.sql.dataframe.DataFrame" format:

| ID | SampleColumn1| SampleColumn2 | SampleColumn3|
|--- |--------------| ------------  | ------------ |
| 1  |sample Apple  | sample Cherry | sample Lime  |
| 2  |sample Cherry | sample lemon  | sample Grape |

I would like to create a new DataFrame based off of this initial dataframe. Should one of several values in a list [Apple, Lime, Cherry] be in ANY of the columns for a row, it would appear as a 1 in the new dataframe for its column. In this case, the output should be:

listOfValues = ['Apple','Lime','Cherry']

| ID | Apple | Lime | Cherry |
| 1  |  1    |  1   |    1   |
| 2  |  0    |  0   |    1   |

I currently have the following using normal Pandas:

keywords = ['Apple', 'Lime', 'Cherry']
tmp = (df.melt(ignore_index=False)
        .value.str.extract(
            f'({"|".join(keywords)})',
            expand=False)
        .dropna())

res = (pd.crosstab(index=tmp.index, columns=tmp)
        .rename_axis(index=None, columns=None))

I would like to achieve this output but however I would like to use PySpark as the current platform does not allow use of Pandas or normal Python commands.

Concat all columns, iterate over each keyword and check if it exists in the new concat column. This gives you True & False . If you are interested in 1 & 0 , then use when() & otherwise().

df = spark.createDataFrame(data=[["1","sample Apple","sample Cherry","sample Lime"],["2","sample Cherry","sample lemon","sample Grape"],["3","sample nothing","sample nothing","sample nothing"]], schema=["ID","SampleColumn1","SampleColumn2","SampleColumn3"])
keywords = ['Apple', 'Lime', 'Cherry']
columns = [c for c in df.columns if c != "ID"]

df = df.select("ID", F.concat_ws(" ", *columns).alias("all"))

for k in keywords:
  df = df.withColumn(k, F.when(F.lower(F.col("all")).contains(k.lower()), F.lit(1)).otherwise(F.lit(0)))

df = df.drop("all")

[Out]:
+---+-----+----+------+
| ID|Apple|Lime|Cherry|
+---+-----+----+------+
|  1|    1|   1|     1|
|  2|    0|   0|     1|
|  3|    0|   0|     0|
+---+-----+----+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM