简体   繁体   English

Pyspark 如何计算每个组中字符串的出现次数并打印多个选定的列?

[英]Pyspark how to count the number of occurences of a string in each group and print multiple selected columns?

My data set looks like this.我的数据集看起来像这样。 Each row represents a car.每行代表一辆汽车。 Each car is located at an Auto Center , has a Model , Make , and a bunch of other attributes.每辆车都位于Auto Center ,具有ModelMake和许多其他属性。 This is a simplified version of the data frame.这是数据框的简化版本。 Extraneous rows and columns have been omitted for clarity.为清楚起见,省略了无关的行和列。

+===========+========+=======+====+=====+
|Auto Center|Model   |Make   |Year|Color|
+===========+========+=======+====+=====+
|Roseville  |Corolla |Toyota |    |     |
|Roseville  |Prius   |Toyota |    |     |
|Rocklin    |Camry   |Toyota |    |     |
|Rocklin    |Forester|Subaru |    |     |
+===========+========+=======+====+=====+

What do I want to do?我想做什么? I want to group the data by the Auto Center , and display a "list" of the top 5 cars in each Auto Center by quantify, and print their attributes Make , Model , Year , and Color .我想按Auto Center对数据进行分组,并通过量化显示每个Auto Center中排名前 5 的汽车的“列表”,并打印它们的属性MakeModelYearColor

After grouping the data by the Auto Center , I want to count the number of occurrences of each Model , or even better a combination of Make and Model , in each Auto Center , and I want to get a list of the top 5 cars with the most occurrences.在按Auto Center对数据进行分组后,我想计算每个Model的出现次数,或者更好的是在每个Auto Center中计算MakeModel的组合,并且我想获得前 5 名汽车的列表大多数情况。 Then I want to print multiple columns of that car.然后我想打印那辆车的多列。

Assume that the Year and the Color are the same for each car having the same Make and Model .假设每辆具有相同MakeModel的汽车的YearColor都相同。

For example, the output should be something like this, a list of the top 5 cars in each auto center ordered by the number of occurrences.例如,输出应该是这样的,每个汽车中心的前 5 辆汽车的列表按出现次数排序。

Rosevile:
there are 12 red Toyota Prius 2009
there are 8 blue Toyota Cary 2010
 ...

This is what I have so far:这是我到目前为止所拥有的:

from pyspark import SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType

sc = SparkContext()
sqlContext = SQLContext(sc)
scSpark = SparkSession \
    .builder \
    .appName("Auto Center Big Data") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

data = scSpark.read.csv("autocenters.csv", header=True, inferSchema=True)
data.printSchema();

data.groupby('Auto Center')

It seems that data.groupby() returns a GroupedData object.似乎data.groupby()返回一个GroupedData对象。 I have seem that .agg() function can be applied to it, but that only works for numerical data, such as finding the mean of some numbers, and here I have strings.我似乎可以将.agg()函数应用于它,但这只适用于数值数据,例如查找某些数字的平均值,这里我有字符串。 I want to count the strings by number of occurrences in each group.我想按每组中出现的次数来计算字符串。

What should I do?我应该怎么办? Is there a way to apply an aggregate function to multiple columns simultaneously, such as both Make and Model together?有没有办法同时将聚合函数应用于多个列,例如MakeModel一起? If not, that should be fine though, considering that there are no cars with the same Model having different Make s.如果不是,那应该没问题,因为没有相同Model的汽车具有不同的Make s。

IIUC, you can do it with the following two steps: IIUC,你可以通过以下两个步骤来完成:

  1. First groupby all columns you want to count on the occurences:首先对所有要计算出现次数的列进行分组:

     df1 = df.groupby('Auto Center', 'Model', 'Make', 'Year', 'Color').count()
  2. Then set up a Window Spec and get the top 5 by row_number(): ( Note: depends on how you want to handle ties, you might want to change the function row_number() to rank() or dense_rank() )然后设置 Window Spec 并通过 row_number() 获得前5名:(注意:取决于您要如何处理关系,您可能需要将函数row_number()更改为rank()dense_rank()

     from pyspark.sql import Window from pyspark.sql.functions import row_number, desc w1 = Window.partitionBy('Auto Center').orderBy(desc('count')) df_new = df1.withColumn('rn', row_number().over(w1)).where('rn <= 5').drop('rn')

import pyspark导入pyspark

from pyspark.sql import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

then create a random dataframe similar to yours然后创建一个类似于你的随机数据框

import string
import random

n = 1000
center = ['center_{}'.format(random.choice(string.ascii_letters[:2])) for y in range(n)]
make = ['make_{}'.format(random.choice(string.ascii_letters[:3])) for y in range(n)]
model = ['model_{}'.format(random.choice(string.ascii_letters[:3])) for y in range(n)]
year = [random.choice(range(2018, 2019)) for y in range(n)]
color = [random.choice(['red', 'blue', 'black']) for y in range(n)]

df = spark.createDataFrame(zip(center, make, model, year, color), schema=['center', 'make', 'model', 'year', 'color'])

df.head()

Row(center='center_b', make='make_a', model='model_c', year=2018, color='black')

group by center, make, model, year, color按中心、品牌、型号、年份、颜色分组

df_groupby = (df
              .groupby('center', 'make', 'model', 'year', 'color')
              .count()
             )
df_groupby.sort(df_groupby['center'],
                df_groupby['count'].desc()).show()

+--------+------+-------+----+-----+-----+
|  center|  make|  model|year|color|count|
+--------+------+-------+----+-----+-----+
|center_a|make_c|model_b|2018| blue|   33|
|center_a|make_a|model_c|2018| blue|   24|
|center_a|make_a|model_a|2018|  red|   23|
|center_a|make_b|model_c|2018| blue|   21|
|center_a|make_b|model_b|2018|black|   21|
|center_a|make_c|model_a|2018|black|   21|
|center_a|make_c|model_c|2018| blue|   21|
|center_a|make_a|model_c|2018|black|   20|
|center_a|make_a|model_b|2018|  red|   20|
|center_a|make_a|model_b|2018| blue|   19|
|center_a|make_c|model_c|2018|black|   18|
|center_a|make_a|model_c|2018|  red|   18|
|center_a|make_c|model_b|2018|  red|   18|
|center_a|make_b|model_b|2018|  red|   18|
|center_a|make_c|model_a|2018|  red|   18|
|center_a|make_a|model_b|2018|black|   18|
|center_a|make_b|model_c|2018|black|   18|
|center_a|make_a|model_a|2018| blue|   17|
|center_a|make_c|model_a|2018| blue|   17|
|center_a|make_c|model_b|2018|black|   15|
+--------+------+-------+----+-----+-----+
only showing top 20 rows

using a Window, keep only the top 5 make/model/color/year per center使用窗口,每个中心只保留前 5 个品牌/型号/颜色/年份

from pyspark.sql import Window
from pyspark.sql.functions import row_number, desc

w = Window.partitionBy('center').orderBy(desc('count'))
df_groupby2 = df_groupby.withColumn('rn', row_number().over(w)).where('rn <= 5').drop('rn')

df_groupby2.sort(df_groupby2['center'],
                 df_groupby2['count'].desc()
                ).show()

+--------+------+-------+----+-----+-----+
|  center|  make|  model|year|color|count|
+--------+------+-------+----+-----+-----+
|center_a|make_c|model_b|2018| blue|   33|
|center_a|make_a|model_c|2018| blue|   24|
|center_a|make_a|model_a|2018|  red|   23|
|center_a|make_b|model_b|2018|black|   21|
|center_a|make_c|model_a|2018|black|   21|
|center_b|make_a|model_a|2018|  red|   31|
|center_b|make_c|model_c|2018|black|   24|
|center_b|make_b|model_a|2018| blue|   24|
|center_b|make_b|model_b|2018|black|   23|
|center_b|make_c|model_c|2018| blue|   23|
+--------+------+-------+----+-----+-----+

now create and print your text现在创建并打印您的文本

df_final = (df_groupby2
            .withColumn('text', F.concat(F.lit("there are "),
                                         df_groupby2['count'],
                                         F.lit(" "),
                                         df_groupby2['color'],
                                         F.lit(" "),
                                         df_groupby2['make'],
                                         F.lit(" "),
                                         df_groupby2['model'],
                                         F.lit(" "),
                                         df_groupby2['year'])
                       )
            .sort(df_groupby2['center'],
                 df_groupby2['count'].desc()
                )
           )

for row in df_final.select('center').distinct().sort('center').collect():
    current_center = row['center']
    print(current_center, ":")
    for row in df_final.filter(df_final['center'] == current_center).collect():
        print(row['text'])

center_a :
there are 33 blue make_c model_b 2018
there are 24 blue make_a model_c 2018
there are 23 red make_a model_a 2018
there are 21 black make_b model_b 2018
there are 21 black make_c model_a 2018
center_b :
there are 31 red make_a model_a 2018
there are 24 blue make_b model_a 2018
there are 24 black make_c model_c 2018
there are 23 black make_b model_b 2018
there are 23 blue make_c model_c 2018

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM