[英]Filter a pandas Dataframe based on specific month values and conditional on another column
[英]Conditional filter threshold on pandas dataframe column based on another column value
假设我有一个包含两列的 dataframe,我想根据由第一列的值确定的不同阈值来过滤第二列的值。 这样的阈值在字典中定义,其键是第一列值,字典值是阈值。 还将有一个默认值来匹配没有任何指定值的列。
例如:
thresholds_dict = {"A": 5, "B": 2, "C": 4, "default": 0}
sample_dataframe =
| Column1 | Column2 |
| A | 3 |
| A | 6 |
| B | 4 |
| B | 1 |
| C | 2 |
| D | 0 |
//Get threshold from dict based on value of Column1 on ...
result_dataframe = sample_dataframe[sample_dataframe[Column2] >= ...]
result_dataframe =
| Column1 | Column2 |
| A | 6 |
| B | 4 |
| D | 0 |
实现这一目标的最佳方法是什么? (不确定在...部分写什么)。
PySpark 版本。
您的 dataframe:
from pyspark.sql import functions as F
sample_dataframe = spark.createDataFrame(
[("A", 3),
("A", 6),
("B", 4),
("B", 1),
("C", 2),
("D", 0)],
["Column1", "Column2"]
)
thresholds_dict = {"A": 5, "B": 2, "C": 4, "default": 0}
脚本:
comparison = F.when(F.lit(False), None)
for k, v in thresholds_dict.items():
comparison = comparison.when(F.col("Column1") == k, v)
comparison = comparison.otherwise(thresholds_dict["default"])
result_dataframe = sample_dataframe.filter(F.col("Column2") >= comparison)
result_dataframe.show()
# +-------+-------+
# |Column1|Column2|
# +-------+-------+
# | A| 6|
# | B| 4|
# | D| 0|
# +-------+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.