根据PySpark中的另一列填充不同的列

Question

我在PySpark中有一个像下面这样的数据框。 我想从下面的数据device_model distinct of timestamp for each serial_num选择serial_num ， devicetype ， device_model和distinct of timestamp for each serial_num ：

+-------------+-----------------+---------------+------------------------+
| serial_num  |   devicetype    | device_model  |        timestamp       |
+-------------+-----------------+---------------+------------------------+
| 58172A0396  |                 |               | 2003-01-02 17:37:15.0  |
| 58172A0396  |                 |               | 2003-01-02 17:37:15.0  |
| 46C5Y00693  | Mac Pro         | Mac PC        | 2018-01-03 17:17:23.0  |
| 1737K7008F  | Windows PC      | Windows PC    | 2018-01-05 11:12:31.0  |
| 1737K7008F  | Network Device  | Unknown       | 2018-01-05 11:12:31.0  |
| 1737K7008F  | Network Device  | Unknown       | 2018-01-05 11:12:31.0  |
| 1737K7008F  | Network Device  |               | 2018-01-06 03:12:52.0  |
| 1737K7008F  | Windows PC      | Windows PC    | 2018-01-06 03:12:52.0  |
| 1737K7008F  | Network Device  | Unknown       | 2018-01-06 03:12:52.0  |
| 1665NF01F3  | Network Device  | Unknown       | 2018-01-07 03:42:34.0  |
+----------------+-----------------+---------------+---------------------+

我已经尝试过如下

df1 = df.select('serial_num', 'devicetype', 'device_model', f.count('distinct timestamp').over(Window.partitionBy('serial_num')).alias('val')

我想要的结果是：

+-------------+-----------------+---------------+-----+
| serial_num  |   devicetype    | device_model  |count|
+-------------+-----------------+---------------+-----+
| 58172A0396  |                 |               |  1  |
| 58172A0396  |                 |               |  1  |
| 46C5Y00693  | Mac Pro         | Mac PC        |  1  |
| 1737K7008F  | Windows PC      | Windows PC    |  2  |
| 1737K7008F  | Network Device  | Unknown       |  2  |
| 1737K7008F  | Network Device  | Unknown       |  2  |
| 1737K7008F  | Network Device  |               |  2  |
| 1737K7008F  | Windows PC      | Windows PC    |  2  |
| 1737K7008F  | Network Device  | Unknown       |  2  |
| 1665NF01F3  | Network Device  | Unknown       |  1  |
+-------------+-----------------+---------------+-----+

我该如何实现？

Answer 1

不幸的是，Windows不支持countDistinct 。 但是， collect_set和size的组合可用于实现相同的最终结果。 仅Spark 2.0+版本支持此功能，请按以下方式使用：

import pyspark.sql.funcions as F

w = Window.partitionBy('serial_num')
df1 = df.select(..., F.size(F.collect_set('timestamp').over(w)).alias('count'))

对于较旧的Spark版本，您可以使用groupby和countDistinct创建具有所有计数的新数据countDistinct 。 然后将此数据框与原始数据框一起join 。

df2 = df.groupby('serial_num').agg(F.countDistinct('timestamp').alias('count'))
df1 = df.join(df2, 'serial_num')

Answer 2

简单的groupBy和count将起作用。

val data=Array(("58172A0396","","","2003-01-02 17:37:15.0"),
("58172A0396","","","2003-01-02 17:37:15.0"),
("46C5Y00693"," Mac Pro","Mac PC","2018-01-03 17:17:23.0"),
("1737K7008F"," Windows PC","Windows PC","2018-01-05 11:12:31.0"),
("1737K7008F"," Network Device","Unknown","2018-01-05 11:12:31.0"),
("1737K7008F"," Network Device","Unknown","2018-01-05 11:12:31.0"),
("1737K7008F"," Network Device","","2018-01-06 03:12:52.0"),
("1737K7008F"," Windows PC","Windows PC","2018-01-06 03:12:52.0"),
("1737K7008F"," Network Device","Unknown","2018-01-06 03:12:52.0"),
("1665NF01F3"," Network Device","Unknown","2018-01-07 03:42:34.0"))

val rdd = sc.parallelize(data)

val df = rdd.toDF("serial_num","devicetype","device_model","timestamp")

val df1 = df.groupBy("timestamp","serial_num","devicetype","device_model").count

根据PySpark中的另一列填充不同的列

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-06-05 05:09:49

解决方案2
1 2018-06-05 06:13:15

根据PySpark中的另一列填充不同的列

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-06-05 05:09:49

解决方案2 1 2018-06-05 06:13:15

解决方案1
1 已采纳 2018-06-05 05:09:49

解决方案2
1 2018-06-05 06:13:15