简体   繁体   English

Pyspark-根据来自不同数据框的值向数据框添加列

[英]Pyspark - add columns to dataframe based on values from different dataframe

I have two dataframes. 我有两个数据框。

AA = 

+---+----+---+-----+-----+
| id1|id2| nr|cell1|cell2|
+---+----+---+-----+-----+
|  1|   1|  0| ab2 | ac3 |
|  1|   1|  1| dg6 | jf2 |
|  2|   1|  1| 84d | kf6 |
|  2|   2|  1| 89m | k34 |
|  3|   1|  0| 5bd | nc4 |
+---+----+---+-----+-----+

and a second dataframe BB , which looks like: 第二个数据框BB ,如下所示:

BB = BB =

+---+----+---+-----+
| a |   b|use|cell |
+---+----+---+-----+
|  1|   1|  x| ab2 |
|  1|   1|  a| dg6 |
|  2|   1|  b| 84d |
|  2|   2|  t| 89m |
|  3|   1|  d| 5bd |
+---+----+---+-----+

where, in BB , the cell section, I have all possible cells that can appear in the AA cell1 and cell2 sections ( cell1 - cell2 is an interval). 其中,在BB ,细胞部分,我有可以出现在所有可能的细胞AA cell1cell2部分( cell1 - cell2是一个间隔)。

I want to add two columns to BB , val1 and val2 . 我想向BB添加两个列, val1val2 The conditions are the following. 条件如下。

val1 has 1 values when:
             id1 == id2 (in AA) , 
         and cell (in B) == cell1 or cell2 (in AA)
         and nr = 1 in AA.

and 0 otherwise. 

The other column is constructed according to: 另一列根据以下内容构造:

val 2 has 1 values when:
           id1 != id2 in (AA)
      and  cell (in B) == cell1 or cell 2 in (AA)
      and  nr = 1 in AA.

      it also has 0 values otherwise.

My attempt: I tried to work with: 我的尝试:我尝试与:

from pyspark.sql.functions import when, col

condition = col("id1") == col("id2")
result = df.withColumn("val1", when(condition, 1)
result.show()

But it soon became apparent that this task is way over my pyspark skill level. 但是很快就发现该任务已经超出了我的pyspark技能水平。

EDIT: 编辑:

I am trying to run : 我正在尝试运行:

condition1 = AA.id1 == AA.id2
condition2 = AA.nr == 1
condition3 = AA.cell1 == BB.cell  | AA.cell2 == BB.cell

result = BB.withColumn("val1", when(condition1 & condition2 & condition3, 1).otherwise(0)

Gives an error inside a Zeppelin notebook: 在Zeppelin笔记本中给出错误:

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-4362.py", line 344, in <module>
    code = compile('\n'.join(final_code), '<stdin>', 'exec', ast.PyCF_ONLY_AST, 1)
  File "<stdin>", line 6
    __zeppelin__._displayhook()
               ^
SyntaxError: invalid syntax

EDIT2: Thanks for the correction, I was missing a closing bracket. EDIT2:感谢您的更正,我错过了一个右括号。 However now I get 但是现在我明白了

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Which is awkward, since I am already using these operators. 这很尴尬,因为我已经在使用这些运算符。

In my opinion the best way might be a join of the two dataframes and then you can model the conditions in the when clause. 我认为最好的方法可能是将两个数据框连接起来,然后可以在when子句中对条件进行建模。 I think if you create a new column with withColumn it iterates over the values from the current dataframe, but I think you can not access values from another dataframe and expect it also iterates through the rows there. 我认为,如果使用withColumn创建一个新列,它将遍历当前数据帧中的值,但是我认为您无法访问另一个数据帧中的值,并且期望它也遍历该行中的所有行。 The following code should fulfill your request: 以下代码应满足您的要求:

df_aa = spark.createDataFrame([
(1,1,0,"ab2", "ac3"),   
(1,1,1,"dg6", "jf2"),   
(2,1,1,"84d", "kf6"),   
(2,2,1,"89m", "k34"),   
(3,1,0,"5bd", "nc4")
], ("id1", "id2","nr","cell1","cell2"))

df_bb = spark.createDataFrame([
(1, 1, "x","ab2"),  
(1, 1, "a","dg6"),  
(2, 1, "b","84d"),  
(2, 2, "t","89m"),  
(3, 1, "d", "5bd")
], ("a", "b","use","cell"))

cond = (df_bb.cell == df_aa.cell1)|(df_bb.cell == df_aa.cell2)
df_bb.join(df_aa, cond, how="full").withColumn("val1", when((col("id1")==col("id2")) & ((col("cell")==col("cell1"))|(col("cell")==col("cell2"))) & (col("nr")==1), 1).otherwise(0)).withColumn("val2", when(~(col("id1")==col("id2")) & ((col("cell")==col("cell1"))|(col("cell")==col("cell2"))) & (col("nr")==1), 1).otherwise(0)).show()

Result looks like: 结果看起来像:

+---+---+---+----+---+---+---+-----+-----+----+----+
|  a|  b|use|cell|id1|id2| nr|cell1|cell2|val1|val2|
+---+---+---+----+---+---+---+-----+-----+----+----+
|  1|  1|  x| ab2|  1|  1|  0|  ab2|  ac3|   0|   0|
|  1|  1|  a| dg6|  1|  1|  1|  dg6|  jf2|   1|   0|
|  2|  1|  b| 84d|  2|  1|  1|  84d|  kf6|   0|   1|
|  2|  2|  t| 89m|  2|  2|  1|  89m|  k34|   1|   0|
|  3|  1|  d| 5bd|  3|  1|  0|  5bd|  nc4|   0|   0|
+---+---+---+----+---+---+---+-----+-----+----+----+

It could be that I do not even need to check for the condition cell==cell1|cell==cell2 since that is pretty much the join condition, but to make the when conditions similar to the requirements of you, I put it there 可能我什至不需要检查条件cell==cell1|cell==cell2因为这几乎是cell==cell1|cell==cell2条件,但是为了使when条件类似于您的要求,我把它放在了那里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据不同数据框中的匹配值,将摘要列添加到pandas数据框中 - Add summary columns to a pandas dataframe based on matching values in a different dataframe 将不同数据框中的列添加到PySpark中的目标数据框 - Add columns from different dataframes to target dataframe in PySpark Pyspark Dataframe - 如何根据 2 列中的数据在数据框中添加多列 - Pyspark Dataframe - how to add multiple columns in dataframe, based on data in 2 columns 根据列表中的匹配值在 pyspark 数据框中添加新列 - add a new column in pyspark dataframe based on matching values from a list 如果 pyspark dataframe 基于两列中的值在另一个 dataframe 中,如何删除它们的行? - How to drop rows of a pyspark dataframe if they're in another dataframe based on the values from two columns? 如何根据不同列中的值向 pandas dataframe 添加一列? - How to add one column to pandas dataframe based on values in different columns? 基于过滤器添加新列并添加来自另一个 DataFrame 的值 - Add new columns and add values from another DataFrame based on a filter Pandas 根据来自不同 DataFrame 的值添加新列 - Pandas Add New Columns Based on Vaues from Different DataFrame 如何从两个不同的pandas dataframe列中添加值 - How to add values from two different columns of pandas dataframe 根据不同列上的值排列 dataframe - Arrange dataframe based on values on different columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM