通过PySpark中不同的sub-ArrayType元素进行计数

Question

I have a following JSON structure: 我有以下JSON结构：

{ 
   "stuff": 1, "some_str": "srt", list_of_stuff": [
                  {"element_x":1, "element_y":"22x"}, 
                  {"element_x":3, "element_y":"23x"}
                ]
}, 
{ 
   "stuff": 2, "some_str": "srt2", "list_of_stuff": [
                  {"element_x":1, "element_y":"22x"}, 
                  {"element_x":4, "element_y":"24x"}
                ]
},

When I read it into a PySpark DataFrame as json: 当我将它读入PySpark DataFrame作为json时：

import pyspark.sql
import json
from pyspark.sql import functions as F
from pyspark.sql.types import *

schema = StructType([
       StructField("stuff", IntegerType()),
       StructField("some_str", StringType()),
       StructField("list_of_stuff", ArrayType(
               StructType([
                   StructField("element_x", IntegerType()),
                   StructField("element_y", StringType()),
    ])
))
])


df = spark.read.json("hdfs:///path/file.json/*", schema=schema)
df.show()

I get the following: 我得到以下内容：

+--------+---------+-------------------+
| stuff  | some_str|    list_of_stuff  |
+--------+---------+-------------------+
|   1    |   srt   |  [1,22x], [3,23x] |
|   2    |   srt2  |  [1,22x], [4,24x] |
+--------+---------+-------------------+

Seems like PySpark flattens the key names for the ArrayType, although I can still see them when I do df.printSchema() : 好像PySpark展平了ArrayType的键名，尽管我在执行df.printSchema()时仍然可以看到它们：

root
|-- stuff: integer (nullable = true)
|-- some_str: string (nullable = true)
|-- list_of_stuff: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- element_x: integer (nullable = true)
|    |    |-- element_y: string (nullable = true)

Question: I need to count the distinct occurrences of element_y within my DataFrame. 问题：我需要计算DataFrame中element_y的不同出现次数。 So given the example JSON, I would get this output: 因此，以示例JSON为例，我将获得以下输出：

22x: 2, 23x: 1, 24x :1

I am not sure how to get into the ArrayType and count the distinct values of the sub-element element_y . 我不确定如何进入ArrayType并计算子元素element_y的不同值。 Any help appreciated. 任何帮助表示赞赏。

Answer 1

One way to do this could be using rdd , flatten the array with flatMap and then count: 一种方法是使用rdd ，使用flatMap flatten数组，然后计数：

df.rdd.flatMap(lambda r: [x.element_y for x in r['list_of_stuff']]).countByValue()
# defaultdict(<class 'int'>, {'24x': 1, '22x': 2, '23x': 1})

Or using data frame, explode the column first, then you can access element_y in each array; 或使用数据帧时， explode第一列，则可以访问element_y每个阵列中; groupBy the element_y , then count should give the result you need: 通过element_y groupBy ，然后count应该给出您需要的结果：

import pyspark.sql.functions as F
(df.select(F.explode(df.list_of_stuff).alias('stuff'))
   .groupBy(F.col('stuff').element_y.alias('key'))
   .count()
   .show())
+---+-----+
|key|count|
+---+-----+
|24x|    1|
|22x|    2|
|23x|    1|
+---+-----+

通过PySpark中不同的sub-ArrayType元素进行计数

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-02-16 14:46:48

通过PySpark中不同的sub-ArrayType元素进行计数

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-02-16 14:46:48

解决方案1
2 已采纳 2018-02-16 14:46:48