简体   繁体   English

在Spark 1.6 Dataframe上的其他字段中获取每个组的不同元素

[英]Get the distinct elements of each group by other field on a Spark 1.6 Dataframe

I'm trying to group by date in a Spark dataframe and for each group count the unique values of one column: 我正在尝试按星期在Spark数据框中进行分组,并为每个组计算一列的唯一值:

test.json
{"name":"Yin", "address":1111111, "date":20151122045510}
{"name":"Yin", "address":1111111, "date":20151122045501}
{"name":"Yln", "address":1111111, "date":20151122045500}
{"name":"Yun", "address":1111112, "date":20151122065832}
{"name":"Yan", "address":1111113, "date":20160101003221}
{"name":"Yin", "address":1111111, "date":20160703045231}
{"name":"Yin", "address":1111114, "date":20150419134543}
{"name":"Yen", "address":1111115, "date":20151123174302}

And the code: 和代码:

import pyspark.sql.funcions as func
from pyspark.sql.types import TimestampType
from datetime import datetime

df_y = sqlContext.read.json("/user/test.json")
udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType())
df = df_y.withColumn('datetime', udf_dt(df_y.date))
df_g = df_y.groupby(func.hour(df_y.date))    
df_g.count().distinct().show()

The results with pyspark are pyspark的结果是

df_y.groupby(df_y.name).count().distinct().show()
+----+-----+
|name|count|
+----+-----+
| Yan|    1|
| Yun|    1|
| Yin|    4|
| Yen|    1|
| Yln|    1|
+----+-----+

And what I'm expecting is something like this with pandas: 而我对大熊猫的期待是这样的:

df = df_y.toPandas()
df.groupby('name').address.nunique()
Out[51]: 
name
Yan    1
Yen    1
Yin    2
Yln    1
Yun    1

How can I get the unique elements of each group by another field, like address? 如何通过其他字段获取每个组的唯一元素,例如地址?

There's a way to do this count of distinct elements of each group using the function countDistinct : 有一种方法可以使用函数countDistinct对每组的不同元素进行计数:

import pyspark.sql.functions as func
from pyspark.sql.types import TimestampType
from datetime import datetime

df_y = sqlContext.read.json("/user/test.json")
udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType())
df = df_y.withColumn('datetime', udf_dt(df_y.date))
df_g = df_y.groupby(func.hour(df_y.date))    
df_y.groupby(df_y.name).agg(func.countDistinct('address')).show()

+----+--------------+
|name|count(address)|
+----+--------------+
| Yan|             1|
| Yun|             1|
| Yin|             2|
| Yen|             1|
| Yln|             1|
+----+--------------+

The docs are available [here]( https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html#countDistinct(org.apache.spark.sql.Column , org.apache.spark.sql.Column...)). 文档可在[这里]( https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html#countDistinct ( org.apache.spark.sql)。列 ,org.apache.spark.sql.Column ...))。

a concise and direct answer to groupby a field "_c1" and count the distinct number of values from field "_c2": 通过字段“_c1”对groupby进行简明直接的回答,并计算字段“_c2”中不同数量的值:

import pyspark.sql.functions as F

dg = df.groupBy("_c1").agg(F.countDistinct("_c2"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM