[英]PySpark. Passing a Dataframe to a pandas_udf and returning a series
[英]PySpark. Multi-dataframe operations
我有以下两个数据框
Catalog:
+--------+-------+
| Type | Value |
+========+=======+
| Cat | 3 |
+--------+-------+
| Dog | 2 |
+--------+-------+
| Goose | 1 |
+--------+-------+
和
+----+-------+----------+
| ID | ITEM | QUANTITY |
+====+=======+==========+
| 1 | CAT | 10.0 |
+----+-------+----------+
| 1 | DOG | 1.0 |
+----+-------+----------+
| 1 | GOOSE | 0.1 |
+----+-------+----------+
| 2 | CAT | 0.01 |
+----+-------+----------+
| 2 | DOG | 0.001 |
+----+-------+----------+
| 3 | GOOSE | 0.0001 |
+----+-------+----------+
我的目标是创建以下新列
+----+-------+----------+--------+
| ID | ITEM | QUANTITY | Value |
+====+=======+==========+========+
| 1 | CAT | 10.0 | 30 |
+----+-------+----------+--------+
| 1 | DOG | 1.0 | 2 |
+----+-------+----------+--------+
| 1 | GOOSE | 0.1 | 0.1 |
+----+-------+----------+--------+
| 2 | CAT | 0.01 | 0.03 |
+----+-------+----------+--------+
| 2 | DOG | 0.001 | 0.002 |
+----+-------+----------+--------+
| 3 | GOOSE | 0.0001 | 0.0001 |
+----+-------+----------+--------+
使用 PySpark,我需要将数量列中的值乘以(或在某些情况下除以)与项目/类型匹配的值列中的值?
df.show()
df1.show()
+-----+-----+
| Type|Value|
+-----+-----+
| Cat| 3|
| Dog| 2|
|Goose| 1| #df
+-----+-----+
+---+-----+--------+
| ID| ITEM|QUANTITY|
+---+-----+--------+
| 1| CAT| 10.0|
| 1| DOG| 1.0|
| 1|GOOSE| 0.1|
| 2| CAT| 0.01|
| 2| DOG| 0.001| #df1
| 3|GOOSE| 1.0E-4|
+---+-----+--------+
您可以对upper(Type)
进行连接,因为ITEM
都是大写的,然后通过*
乘法创建新的列值。
df1.join(df,F.expr("""ITEM=upper(Type)""")).drop("Type")\
.withColumn("Value", F.col("Value")*F.col("QUANTITY")).orderBy("ID","ITEM").show(truncate=False)
+---+-----+--------+------+
|ID |ITEM |QUANTITY|Value |
+---+-----+--------+------+
|1 |CAT |10.0 |30.0 |
|1 |DOG |1.0 |2.0 |
|1 |GOOSE|0.1 |0.1 |
|2 |CAT |0.01 |0.03 |
|2 |DOG |0.001 |0.002 |
|3 |GOOSE|1.0E-4 |1.0E-4|
+---+-----+--------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.