如何对 PySpark Dataframe 中的两列进行分层抽样？

Question

I want to sample below data set based on IDs and the comm_type they fall into;我想根据 ID 和它们所属的 comm_type 对以下数据集进行采样； The same IDs can have multiple comm_types, the data set is huge so I want to do further analysis on a smaller sample of 1 million unique IDs;同一个 ID 可以有多个 comm_types，数据集很大，所以我想对 100 万个唯一 ID 的较小样本做进一步分析； I see there is a sampleBy(col, fractions, seed=None), method to perform this but I need to group the data by comm_type and then sample by IDs, I am struggling to figure out the best way to do it.我看到有一个 sampleBy(col, fractions, seed=None), 方法可以执行此操作，但我需要按 comm_type 对数据进行分组，然后按 ID 进行采样，我正在努力找出最好的方法。 There are other fields in the dataset as well but the sampling needs to happen on these two columns.数据集中还有其他字段，但采样需要在这两列上进行。

The fractions for the comm_type should match the original data in the DF, E = 0.5, M = 0.4, P= 0.1, and the unique IDs in original DF is around 19 M, I only need to sample 1 M of the dataset keeping the comm_type fractions consistent to the original dataset. comm_type 的分数应该与 DF 中的原始数据匹配，E = 0.5，M = 0.4，P= 0.1，原始 DF 中的唯一 ID 约为 19 M，我只需要采样 1 M 的数据集，保持comm_type 分数与原始数据集一致。

Will appreciate any help or direction.将不胜感激任何帮助或指导。

Answer 1

You can use scikit learn train_test_split function.您可以使用 scikit 学习 train_test_split 功能。 Function accepts multiple columns for strata.函数接受分层的多个列。

sklearn.model_selection.train_test_split(*arrays, test_size=None, 
train_size=None, random_state=None, shuffle=True, stratify=df[columns to 
stratify])

如何对 PySpark Dataframe 中的两列进行分层抽样？

问题描述

1 个解决方案

解决方案1
0 2022-05-23 19:49:39

如何对 PySpark Dataframe 中的两列进行分层抽样？

问题描述

1 个解决方案

解决方案1 0 2022-05-23 19:49:39

解决方案1
0 2022-05-23 19:49:39