[英]Pyspark, writing a loop to create multiple new columns based on different conditions
Lets say I have a Pyspark DataFrame with the following columns:假设我有一个 Pyspark DataFrame 具有以下列:
user, score, country, risky/safe, payment_id用户、分数、国家、风险/安全、payment_id
I made a list of thresholds: [10, 20, 30]我列出了阈值:[10,20,30]
Now I want to make a new columns for each threshold:现在我想为每个阈值创建一个新列:
both of them should be grouped by country.两者都应按国家/地区分组。
The result should be something like this:结果应该是这样的:
Country | % payments thresh 10 | % users thresh 10 | % payments thresh 20 ...
A
B
C
I was able to make it work with an external for loop but I want it to be all in one dataframe.我能够使其与外部 for 循环一起工作,但我希望它全部在一个 dataframe 中。
thresholds = [10, 20, 30]
for thresh in thresholds:
df = (df
.select('country', 'risk/safe', 'user', 'payment')
.where(F.col('risk\safe') == 'risk')
.groupBy('country').agg(F.sum(F.when(
(F.col('score') >= thresh),1
)) / F.count('country').alias('% payments'))
Use a list comprehension within the agg()
.在agg()
中使用列表推导。
pay_aggs = [(func.sum((func.col('score')>=thresh).cast('int'))/func.count('country')).alias('% pay '+str(thresh)) for thresh in thresholds]
user_aggs = [(func.countDistinct(func.when(func.col('score')>=thresh, func.col('user')))/func.countDistinct('user')).alias('% user '+str(thresh)) for thresh in thresholds]
df. \
select('country', 'risk/safe', 'user', 'payment'). \
where(func.col('risk\safe') == 'risk'). \
groupBy('country'). \
agg(*pay_aggs, *user_aggs)
The pay_aggs
list will generate the following aggregations (you can easily print the list) pay_aggs
列表将生成以下聚合(您可以轻松打印列表)
# [Column<'(sum(CAST((score >= 10) AS INT)) / count(country)) AS `% pay 10`'>,
# Column<'(sum(CAST((score >= 20) AS INT)) / count(country)) AS `% pay 20`'>,
# Column<'(sum(CAST((score >= 30) AS INT)) / count(country)) AS `% pay 30`'>]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.