在Pyspark中为每个ID查找模态值

Question

I have a pyspark dataframe of ~1.7billion rows with the schema: 我有一个带有模式的〜17亿行的pyspark数据框：

INPUT SCHEMA
id  
ip  
datetime

and I am trying to find the modal ip for each id 我正在尝试为每个id查找模态ip

I currently have a function where I make a separate table of 我目前有一个函数，可以在其中创建一个单独的表

INT TABLE
id
ip
number_of_records

and then filter that for the modal ip 然后将其过滤为模态IP

This seems incredibly slow and bulky, what is a more efficient way to get the modal ip for each ip? 这似乎令人难以置信的缓慢且庞大，为每个IP获取模态IP的更有效方法是什么？

Proposed Output Schema
id
modal_ip

Thanks all! 谢谢大家！

Answer 1

Expanding on my comments, here's a solution which demonstrates how you can technically achieve this in two passses of the data - one to count, and one to reduce and find (multiple) modes. 扩展我的评论，这是一个解决方案，演示了如何在技术上通过两遍数据来实现这一点-一次计数，一次减少和查找（多种）模式。 I've implemented the second part with the RDD API - translating into the DataFrame API is left to the reader ;) (tbh I don't know if it's even possible to do custom aggregations with multiple output rows like this): 我已经用RDD API实现了第二部分-转换为DataFrame API留给了读者;）（tbh我不知道是否有可能对多个输出行进行自定义聚合，如下所示）：

from pyspark.sql import types

import pandas as pd

from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType

# Example data
data = [
    (0 ,'12.2.25.68'),
    (0 ,'12.2.25.68'),
    (0 ,'12.2.25.43'),
    (1 ,'62.251.0.149'),  # This ID has two modes
    (1 ,'62.251.0.140'),
]

schema = types.StructType([
    types.StructField('id', types.IntegerType()),
    types.StructField('ip', types.StringType()),
])

df = spark.createDataFrame(data, schema)

# Count id/ip pairs
df = df.groupBy('id', 'ip').count()

def find_modes(a, b):
    """
    Reducing function to find modes (can return multiple). 

    a and b are lists of Row
    """
    if a[0]['count'] > b[0]['count']:
        return a
    if a[0]['count'] < b[0]['count']:
        return b
    return a + b

result = (
    df.rdd
    .map(lambda row: (row['id'], [row]))
    .reduceByKey(find_modes)
    .collectAsMap()
)

Result: 结果：

{0: [Row(id=0, ip='12.2.25.68', count=2)],
 1: [Row(id=1, ip='62.251.0.149', count=1),
 Row(id=1, ip='62.251.0.140', count=1)]}

Small caveat to this approach: because I aggregate repeated modes in-memory, if you have many different IPs with the same count for a single ID, you do risk OOM issues. 这种方法的一个小警告：因为我在内存中汇总了重复的模式，所以如果您有多个不同的IP，且单个ID的计数相同，那么您确实会面临OOM问题。 For this particular application, I'd say it's very unlikely (eg a single user probably won't have 1 million different IPs, all with 1 event). 对于这个特定的应用程序，我想这不太可能（例如，单个用户可能不会拥有100万个不同的IP，而只有一个事件）。

But I tend to agree with @absolutelydevastated, the simplest solution is probably the one you have already, even if it has an extra pass of the data. 但是我倾向于同意@absolutelydeasted，最简单的解决方案可能是您已经拥有的解决方案，即使它额外传递了数据。 But you should probably avoid doing a sort / rank and instead just seek the max count in the window if possible. 但是您可能应该避免进行sort / rank ，而是尽可能在窗口中查找最大数量。

在Pyspark中为每个ID查找模态值

问题描述

1 个解决方案

解决方案1
1 2019-09-09 15:45:25

在Pyspark中为每个ID查找模态值

问题描述

1 个解决方案

解决方案1 1 2019-09-09 15:45:25

解决方案1
1 2019-09-09 15:45:25