简体   繁体   English

根据另一个数据框中的值更新数据框中的列

[英]Update a column in a dataframe, based on the values in another dataframe

I have two dataframes, df1 and df2 : 我有两个数据df1df2

df1.show()
+---+--------+-----+----+--------+
|cA |   cB   |  cC | cD |   cE   |
+---+--------+-----+----+--------+
|  A|   abc  | 0.1 | 0.0|   0    |
|  B|   def  | 0.15| 0.5|   0    |
|  C|   ghi  | 0.2 | 0.2|   1    |
|  D|   jkl  | 1.1 | 0.1|   0    |
|  E|   mno  | 0.1 | 0.1|   0    |
+---+--------+-----+----+--------+


df2.show()
+---+--------+-----+----+--------+
|cA |   cB   |  cH | cI |   cJ   |
+---+--------+-----+----+--------+
|  A|   abc  | a   | b  |   ?    |
|  C|   ghi  | a   | c  |   ?    |
+---+--------+-----+----+--------+

I would like to update cE column in df1 and set it to 1 , if the row is referenced in df2 . 如果要在df2引用该行,我想更新df1 cE列并将其设置为1 Each record is identified by cA and cB columns. 每个记录由cAcB列标识。

Below is the desired output; 以下是所需的输出; Note that the cE value of the first record was updated to 1 : 请注意,第一条记录的cE值已更新为1

+---+--------+-----+----+--------+
|cA |   cB   |  cC | cD |   cE   |
+---+--------+-----+----+--------+
|  A|   abc  | 0.1 | 0.0|   1    |
|  B|   def  | 0.15| 0.5|   0    |
|  C|   ghi  | 0.2 | 0.2|   1    |
|  D|   jkl  | 1.1 | 0.1|   0    |
|  E|   mno  | 0.1 | 0.1|   0    |
+---+--------+-----+----+--------+

Here is my answer. 这是我的答案。

It's scala code - sorry for that - i don't have python installed. 这是scala代码-抱歉-我没有安装python。 Hopefully that helps. 希望有帮助。

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val ss = SparkSession.builder().master("local").getOrCreate()

import ss.implicits._

val seq1 = Seq(
  ("A", "abc", 0.1, 0.0, 0),
  ("B", "def", 0.15, 0.5, 0),
  ("C", "ghi", 0.2, 0.2, 1),
  ("D", "jkl", 1.1, 0.1, 0),
  ("E", "mno", 0.1, 0.1, 0)
)

val seq2 = Seq(
  ("A", "abc", "a", "b", "?"),
  ("C", "ghi", "a", "c", "?")
)


val df1 = ss.sparkContext.makeRDD(seq1).toDF("cA", "cB", "cC", "cD", "cE")
val df2 = ss.sparkContext.makeRDD(seq2).toDF("cA", "cB", "cH", "cI", "cJ")


val joined = df1.join(df2, (df1("cA") === df2("cA")).and(df1("cB") === df2("cB")), "left")

val res = joined.withColumn("newCe",
  when(df2("cA").isNull.and(joined("cE") === lit(0)), lit(0)).otherwise(lit(1)))


res.select(df1("cA"), df1("cB"), df1("cC"), df1("cD"), res("newCe"))
  .withColumnRenamed("newCe", "cE")
  .show

And the output for me is: 对我来说输出是:

+---+---+----+---+---+
| cA| cB|  cC| cD| cE|
+---+---+----+---+---+
|  E|mno| 0.1|0.1|  0|
|  B|def|0.15|0.5|  0|
|  C|ghi| 0.2|0.2|  1|
|  A|abc| 0.1|0.0|  1|
|  D|jkl| 1.1|0.1|  0|
+---+---+----+---+---+

When there is scenario of updating a column value based on another column, then the when clause comes handy. 如果存在基于另一列更新列值的情况,那么when子句将很方便。 Please Refer the when and otherwise clause. 请参考when and else子句。

import pyspark.sql.functions as F
df3=df1.join(df2,(df1.cA==df2.cA)&(df1.cB==df2.cB),"full").withColumn('cE',F.when((df1.cA==df2.cA)&(df1.cB==df2.cB),1).otherwise(0)).select(df1.cA,df1.cB,df1.cC,df1.cD,'cE')
df3.show()
+---+---+----+---+---+
| cA| cB|  cC| cD| cE|
+---+---+----+---+---+
|  E|mno| 0.1|0.1|  0|
|  B|def|0.15|0.5|  0|
|  C|ghi| 0.2|0.2|  1|
|  A|abc| 0.1|0.0|  1|
|  D|jkl| 1.1|0.1|  0|
+---+---+----+---+---+

Using join you can do what you want : 使用join可以做你想做的事情:

df1 = pd.DataFrame({ 'cA' : ['A', 'B', 'C', 'D', 'E'], 'cB' : ['abc', 'def', 'ghi', 'jkl', 'mno'], 'cE' : [0,0,1, 0, 0]})
df2 = pd.DataFrame({ 'cA' : ['A', 'C'], 'cB' : ['abc', 'ghi'], 'cE' : ['?','?']})

# join
df = df1.join(df2.set_index(['cA', 'cB']),  lsuffix='_df1', rsuffix='_df2', on=['cA', 'cB'])

# nan values indicates rows that are not present in both dataframes
df.loc[~df['cE_df2'].isna(), 'cE_df2'] = 1
df.loc[df['cE_df2'].isna(), 'cE_df2'] = 0

df1['cE'] = df['cE_df2']

Output : 输出:

    cA  cB  cE
0   A   abc 1
1   B   def 0
2   C   ghi 1
3   D   jkl 0
4   E   mno 0

try this 尝试这个

for i in df2.values:
    df1.loc[(df1.cA==i[0]) & (df1.cB == i[1]),['cE']] = 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM