简体   繁体   English

pyspark 计算具有两个条件的行(AND 语句)

[英]pyspark count rows with two conditions (AND statement)

I have the following code:我有以下代码:

from pyspark.sql import functions as sf
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
    {"Category": 'Category B', "ID": 2, "Value": 30.10},
     {"Category": 'Category C', "ID": 3, "Value": 100.01}
     ]
df = spark.createDataFrame(data)
print(df.schema)
df.show()
df.groupBy().agg(sf.count(sf.when(sf.col("Value")>13, True))).show()

This code gives:这段代码给出:

+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1|  12.4|
|Category B|  2|  30.1|
|Category C|  3|100.01|
+----------+---+------+


+-------------------------------------------+
|count(CASE WHEN (Value > 13) THEN true END)|
+-------------------------------------------+
|                                          2|
+-------------------------------------------+

Which gives the total count of Values greater than 13. However, I want to find the total count of values greater than 13 and less than 100. This answer is '1'.这给出了大于 13 的值的总数。但是,我想找到大于 13 且小于 100 的值的总数。这个答案是“1”。 The code编码

df.groupBy().agg(sf.count(sf.when(sf.col("Value")>13, True)),sf.count(sf.when(sf.col("Value")<100,True))).show()

returns:返回:

  +-------------------------------------------+--------------------------------------------+
  |count(CASE WHEN (Value > 13) THEN true END)|count(CASE WHEN (Value < 100) THEN true END)|
   +-------------------------------------------+--------------------------------------------+
  |                                          2|                                           2|
  +-------------------------------------------+--------------------------------------------+

This isn't correct, this is giving the number of counts greater than 13, which is '2', and the counts less than 100, which is also '2'.这是不正确的,这是给出大于 13 的计数,即“2”,而计数小于 100,也就是“2”。 But its not combining the 'when' functions.但它没有结合“何时”功能。 I also tried:我也试过:

 df.groupBy().agg(sf.count(sf.when(sf.col("Value")>13 & sf.col("Value")<100),True)).show()

gives an error: py4j.protocol.Py4JError: An error occurred while calling o134.and.给出错误:py4j.protocol.Py4JError:调用 o134.and 时发生错误。 Trace:痕迹:

So what is the right code to use to apply the 'and' function and get a desired output of '1'那么,什么是正确的代码来应用“和” function 并获得所需的“1” output

There are many ways to achieve this.有很多方法可以实现这一目标。

Will show you the one that resembles your code.将向您展示与您的代码相似的那个。

df.groupby(f.when(f.col('id')<5,f.lit('lessThan5')).when(f.col('id')>=5,f.lit('GreaterOrEquals to 5'))).count().show()

+------------------------------------------------------------------------------+-----+
|CASE WHEN (id < 5) THEN lessThan5 WHEN (id >= 5) THEN GreaterOrEquals to 5 END|count|
+------------------------------------------------------------------------------+-----+
|                                                                     lessThan5|    2|
|                                                          GreaterOrEquals to 5|    3|
+------------------------------------------------------------------------------+-----+

Hope it helps希望能帮助到你

I believe you just need to wrap the conditions in parenthesis, so instead of:我相信您只需要将条件括在括号中,而不是:

df.groupBy().agg(sf.count(sf.when(sf.col("Value")>13 & sf.col("Value")<100),True)).show()

You can do:你可以做:

df \
.groupBy() \
.agg(
   sf.count(
     sf.when(((sf.col("Value")>13) & (sf.col("Value")<100)))
     ,True
   )
).show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM