简体   繁体   English

从 Spark GroupedData 对象中选择随机项

[英]Choosing random items from a Spark GroupedData Object

I'm new to using Spark in Python and have been unable to solve this problem: After running groupBy on a pyspark.sql.dataframe.DataFrame我是在 Python 中使用 Spark 的新手,无法解决这个问题:在groupBy pyspark.sql.dataframe.DataFrame

df = sqlsc.read.json("data.json")
df.groupBy('teamId')

how can you choose N random samples from each resulting group (grouped by teamId) without replacement?如何从每个结果组(按 teamId 分组)中选择N个随机样本而不进行替换?

I'm basically trying to choose N random users from each team, maybe using groupBy is wrong to start with?我基本上是在尝试从每个团队中选择N个随机用户,也许一开始就使用groupBy是错误的?

Well, it is kind of wrong.好吧,这有点不对。 GroupedData is not really designed for a data access. GroupedData并不是真正为数据访问而设计的。 It just describes grouping criteria and provides aggregation methods.它只是描述了分组标准并提供了聚合方法。 See my answer to Using groupBy in Spark and getting back to a DataFrame for more details.有关更多详细信息,请参阅我对在 Spark 中使用 groupBy 并返回 DataFrame 的回答。

Another problem with this idea is selecting N random samples .这个想法的另一个问题是选择N random samples It is a task which is really hard to achieve in parallel without psychical grouping of data and it is not something that happens when you call groupBy on a DataFrame :如果没有数据的心理分组,这是一项很难并行实现的任务,并且当您在 DataFrame 上call DataFrame时不会发生这种情况:

There are at least two ways to handle this:至少有两种方法可以处理这个问题:

  • convert to RDD, groupBy and perform local sampling转换为 RDD、 groupBy并执行本地采样

    import random n = 3 def sample(iter, n): rs = random.Random() # We should probably use os.urandom as a seed return rs.sample(list(iter), n) df = sqlContext.createDataFrame( [(x, y, random.random()) for x in (1, 2, 3) for y in "abcdefghi"], ("teamId", "x1", "x2")) grouped = df.rdd.map(lambda row: (row.teamId, row)).groupByKey() sampled = sqlContext.createDataFrame( grouped.flatMap(lambda kv: sample(kv[1], n))) sampled.show() ## +------+---+-------------------+ ## |teamId| x1| x2| ## +------+---+-------------------+ ## | 1| g| 0.81921738561455| ## | 1| f| 0.8563875814036598| ## | 1| a| 0.9010425238735935| ## | 2| c| 0.3864428179837973| ## | 2| g|0.06233470405822805| ## | 2| d|0.37620872770129155| ## | 3| f| 0.7518901502732027| ## | 3| e| 0.5142305439671874| ## | 3| d| 0.6250620479303716| ## +------+---+-------------------+
  • use window functions使用窗口函数

    from pyspark.sql import Window from pyspark.sql.functions import col, rand, rowNumber w = Window.partitionBy(col("teamId")).orderBy(col("rnd_")) sampled = (df .withColumn("rnd_", rand()) # Add random numbers column .withColumn("rn_", rowNumber().over(w)) # Add rowNumber over windw .where(col("rn_") <= n) # Take n observations .drop("rn_") # drop helper columns .drop("rnd_")) sampled.show() ## +------+---+--------------------+ ## |teamId| x1| x2| ## +------+---+--------------------+ ## | 1| f| 0.8563875814036598| ## | 1| g| 0.81921738561455| ## | 1| i| 0.8173912535268248| ## | 2| h| 0.10862995810038856| ## | 2| c| 0.3864428179837973| ## | 2| a| 0.6695356657072442| ## | 3| b|0.012329360826023095| ## | 3| a| 0.6450777858109182| ## | 3| e| 0.5142305439671874| ## +------+---+--------------------+

but I am afraid both will be rather expensive.但恐怕两者都会相当昂贵。 If size of the individual groups is balanced and relatively large I would simply use DataFrame.randomSplit .如果各个组的大小是平衡的并且相对较大,我会简单地使用DataFrame.randomSplit

If number of groups is relatively small it is possible to try something else:如果组数相对较少,则可以尝试其他方法:

from pyspark.sql.functions import count, udf
from pyspark.sql.types import BooleanType
from operator import truediv

counts = (df
    .groupBy(col("teamId"))
    .agg(count("*").alias("n"))
    .rdd.map(lambda r: (r.teamId, r.n))
    .collectAsMap()) 

# This defines fraction of observations from a group which should
# be taken to get n values 
counts_bd = sc.broadcast({k: truediv(n, v) for (k, v) in counts.items()})

to_take = udf(lambda k, rnd: rnd <= counts_bd.value.get(k), BooleanType())

sampled = (df
    .withColumn("rnd_", rand())
    .where(to_take(col("teamId"), col("rnd_")))
    .drop("rnd_"))

sampled.show()

## +------+---+--------------------+
## |teamId| x1|                  x2|
## +------+---+--------------------+
## |     1|  d| 0.14815204548854788|
## |     1|  f|  0.8563875814036598|
## |     1|  g|    0.81921738561455|
## |     2|  a|  0.6695356657072442|
## |     2|  d| 0.37620872770129155|
## |     2|  g| 0.06233470405822805|
## |     3|  b|0.012329360826023095|
## |     3|  h|  0.9022527556458557|
## +------+---+--------------------+

In Spark 1.5+ you can replace udf with a call to sampleBy method:在 Spark 1.5+ 中,您可以将sampleBy udf的调用:

df.sampleBy("teamId", counts_bd.value)

It won't give you exact number of observations but should be good enough most of the time as long as a number of observations per group is large enough to get proper samples.它不会为您提供确切的观察次数,但在大多数情况下应该足够好,只要每组的观察次数足够大以获得适当的样本。 You can also use sampleByKey on a RDD in a similar way.您也可以以类似的方式在 RDD 上使用sampleByKey

I found this one more dataframey, rather than going into rdd way.我发现这是一个更多的数据框架,而不是进入 rdd 方式。

You can use window function to create ranking within a group, where ranking can be random to suit your case.您可以使用window函数在组内创建排名,其中排名可以是随机的以适合您的情况。 Then, you can filter based on the number of samples (N) you want for each group然后,您可以根据每个组所需的样本数(N)进行过滤

window_1 = Window.partitionBy(data['teamId']).orderBy(F.rand())
data_1 = data.select('*', F.rank().over(window_1).alias('rank')).filter(F.col('rank') <= N).drop('rank')

Here's an alternative using Pandas DataFrame.Sample method.这是使用 Pandas DataFrame.Sample方法的替代方法。 This uses the spark applyInPandas method to distribute the groups, available from Spark 3.0.0.这使用 spark applyInPandas方法分发组,可从 Spark 3.0.0 获得。 This allows you to select an exact number of rows per group.这允许您选择每组的确切行数。

I've added args and kwargs to the function so you can access the other arguments of DataFrame.Sample .我已将argskwargs添加到函数中,以便您可以访问DataFrame.Sample的其他参数。

def sample_n_per_group(n, *args, **kwargs):
    def sample_per_group(pdf):
        return pdf.sample(n, *args, **kwargs)
    return sample_per_group

df = spark.createDataFrame(
    [
        (1, 1.0), 
        (1, 2.0), 
        (2, 3.0), 
        (2, 5.0), 
        (2, 10.0)
    ],
    ("id", "v")
)

(df.groupBy("id")
   .applyInPandas(
        sample_n_per_group(2, random_state=2), 
        schema=df.schema
   )
)

To be aware of the limitations for very large groups, from the documentation :要从文档中了解非常大的组的限制:

This function requires a full shuffle.此功能需要完全洗牌。 All the data of a group will be loaded into memory, so the user should be aware of the potential OOM risk if data is skewed and certain groups are too large to fit in memory.一个组的所有数据都将加载到内存中,因此如果数据倾斜并且某些组太大而无法放入内存,用户应该注意潜在的 OOM 风险。

See also here: How take a random row from a PySpark DataFrame?另请参阅: 如何从 PySpark DataFrame 中获取随机行?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM