简体   繁体   English

火花。 多数据帧操作

[英]PySpark. Multi-dataframe operations

I have the following two dataframes我有以下两个数据框

Catalog:
+--------+-------+
| Type   | Value |
+========+=======+
| Cat    | 3     |
+--------+-------+
| Dog    | 2     |
+--------+-------+
| Goose  | 1     |
+--------+-------+

And

+----+-------+----------+
| ID | ITEM  | QUANTITY |
+====+=======+==========+
| 1  | CAT   | 10.0     |
+----+-------+----------+
| 1  | DOG   | 1.0      |
+----+-------+----------+
| 1  | GOOSE | 0.1      |
+----+-------+----------+
| 2  | CAT   | 0.01     |
+----+-------+----------+
| 2  | DOG   | 0.001    |
+----+-------+----------+
| 3  | GOOSE | 0.0001   |
+----+-------+----------+

My goal is to create the following new column我的目标是创建以下新列

+----+-------+----------+--------+
| ID | ITEM  | QUANTITY | Value  |
+====+=======+==========+========+
| 1  | CAT   | 10.0     | 30     |
+----+-------+----------+--------+
| 1  | DOG   | 1.0      | 2      |
+----+-------+----------+--------+
| 1  | GOOSE | 0.1      | 0.1    |
+----+-------+----------+--------+
| 2  | CAT   | 0.01     | 0.03   |
+----+-------+----------+--------+
| 2  | DOG   | 0.001    | 0.002  |
+----+-------+----------+--------+
| 3  | GOOSE | 0.0001   | 0.0001 |
+----+-------+----------+--------+

Using PySpark, I need to multiply (or in some cases divide) the values in the quantity column by the values in the Value column as matched by the item/type?使用 PySpark,我需要将数量列中的值乘以(或在某些情况下除以)与项目/类型匹配的值列中的值?

df.show()
df1.show()

+-----+-----+
| Type|Value|
+-----+-----+
|  Cat|    3|
|  Dog|    2|
|Goose|    1| #df
+-----+-----+

+---+-----+--------+
| ID| ITEM|QUANTITY|
+---+-----+--------+
|  1|  CAT|    10.0|
|  1|  DOG|     1.0|
|  1|GOOSE|     0.1|
|  2|  CAT|    0.01|
|  2|  DOG|   0.001| #df1
|  3|GOOSE|  1.0E-4|
+---+-----+--------+

You can do join on upper(Type) because ITEM is all upper case, then create new column value by * multiply.您可以对upper(Type)进行连接,因为ITEM都是大写的,然后通过*乘法创建新的列值。

df1.join(df,F.expr("""ITEM=upper(Type)""")).drop("Type")\
   .withColumn("Value", F.col("Value")*F.col("QUANTITY")).orderBy("ID","ITEM").show(truncate=False)

+---+-----+--------+------+
|ID |ITEM |QUANTITY|Value |
+---+-----+--------+------+
|1  |CAT  |10.0    |30.0  |
|1  |DOG  |1.0     |2.0   |
|1  |GOOSE|0.1     |0.1   |
|2  |CAT  |0.01    |0.03  |
|2  |DOG  |0.001   |0.002 |
|3  |GOOSE|1.0E-4  |1.0E-4|
+---+-----+--------+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM