简体   繁体   English

如果列在另一个 Spark Dataframe 中,Pyspark 创建新列

[英]Pyspark create new column based if a column isin another Spark Dataframe

I am trying to create a column in my Spark Dataframe a flag if a column's row is in a separate Dataframe.如果列的行位于单独的数据帧中,我正在尝试在我的 Spark 数据帧中创建一个标志。

This is my main Spark Dataframe ( df_main )这是我的主要 Spark 数据框 ( df_main )

+--------+
|main    |
+--------+
|28asA017|
|03G12331|
|1567L044|
|02TGasd8|
|1asd3436|
|A1234567|
|B1234567|
+--------+

This is my reference ( df_ref ), there are hundreds of rows in this reference so I obviously can't hard code them like this solution or this one这是我的参考( df_ref ),这个参考中有数百行,所以我显然不能像这个解决方案这个解决方案那样对它们进行硬编码

+--------+
|mask_vl |
+--------+
|A1234567|
|B1234567|
...
+--------+

Normally, what I'd do in pandas' dataframe is this:通常,我会在熊猫的数据框中做的是:

df_main['is_inref'] = np.where(df_main['main'].isin(df_ref.mask_vl.values), "YES", "NO")

So that I would get this所以我会得到这个

+--------+--------+
|main |is_inref|
+--------+--------+
|28asA017|NO      |
|03G12331|NO      |
|1567L044|NO      |
|02TGasd8|NO      |
|1asd3436|NO      |
|A1234567|YES     |
|B1234567|YES     |
+--------+--------+

I have tried the following code, but I don't get what the error in the picture means.我尝试了以下代码,但我不明白图片中的错误是什么意思。

df_main = df_main.withColumn('is_inref', "YES" if F.col('main').isin(df_ref) else "NO")
df_main.show(20, False)

提到的代码错误

You are close.你很近。 I think the additional step that you need, is to explicitly create the list that will contain the values from df_ref .我认为您需要的额外步骤是显式创建包含df_ref值的列表。

Please see below an illustration:请看下图:

# Create your DataFrames
df = spark.createDataFrame(["28asA017","03G12331","1567L044",'02TGasd8','1asd3436','A1234567','B1234567'], "string").toDF("main")
df_ref =  spark.createDataFrame(["A1234567","B1234567"], "string").toDF("mask_vl")

Then, you can create a list and use isin , almost as you have it:然后,您可以创建一个list并使用isin ,就像您拥有的一样:

# Imports
from pyspark.sql.functions import col, when

# Create a list with the values of your reference DF
mask_vl_list = df_ref.select("mask_vl").rdd.flatMap(lambda x: x).collect()

# Use isin to check whether the values in your column exist in the list
df_main = df_main.withColumn('is_inref', when(col('main').isin(mask_vl_list), 'YES').otherwise('NO'))

This will give you:这会给你:

>>> df_main.show()

+--------+--------+
|    main|is_inref|
+--------+--------+
|28asA017|      NO|
|03G12331|      NO|
|1567L044|      NO|
|02TGasd8|      NO|
|1asd3436|      NO|
|A1234567|     YES|
|B1234567|     YES|
+--------+--------+

If you want to avoid collect, I advise you to do the next:如果您想避免收集,我建议您执行以下操作:

df_ref= df_ref
          .withColumnRenamed("mask_v1", "main")
          .withColumn("isPreset", lit("yes"))
      
 main_df= main_df.join(df_ref, Seq("main"), "left_outer")
          .withColumn("is_inref", when(col("isPresent").isNull,
          lit("NO")).otherwise(lit("YES")))

我觉得这个问题已经回答了,你可以在这里查看spark检测未更改的行

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据另一列的字符串搜索在Spark Dataframe中创建具有功能的新列 - Create new column with function in Spark Dataframe based on a string search of another column 如何基于另一个DataFrame中的列在Pandas DataFrame中创建新列? - How to create a new column in a Pandas DataFrame based on a column in another DataFrame? PySpark Dataframe基于类方法创建新列 - PySpark Dataframe create new column based on class method PySpark Dataframe根据函数返回值创建新列 - PySpark Dataframe create new column based on function return value PySpark:根据列条件使用来自另一个行的行创建子集数据框 - PySpark: Create subset dataframe with rows from another based on a column condition Pyspark DataFrame 列基于另一个 DataFrame 值 - Pyspark DataFrame column based on another DataFrame value PySpark DataFrame - 从另一个 dataframe 创建一个列 - PySpark DataFrame - Create a column from another dataframe 根据另一列中的“NaN”值在 Pandas Dataframe 中创建一个新列 - Create a new column in Pandas Dataframe based on the 'NaN' values in another column 根据数据框中另一列的值创建一个新列 - Create a new column based on the values of another column in a dataframe 在数据框中创建一个新列,其增量编号基于另一列 - Create a new column in a dataframe with increment number based on another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM