[英]How to do stratified sampling on two columns in PySpark Dataframe?
I want to sample below data set based on IDs and the comm_type they fall into;我想根据 ID 和它们所属的 comm_type 对以下数据集进行采样; The same IDs can have multiple comm_types, the data set is huge so I want to do further analysis on a smaller sample of 1 million unique IDs;
同一个 ID 可以有多个 comm_types,数据集很大,所以我想对 100 万个唯一 ID 的较小样本做进一步分析; I see there is a sampleBy(col, fractions, seed=None), method to perform this but I need to group the data by comm_type and then sample by IDs, I am struggling to figure out the best way to do it.
我看到有一个 sampleBy(col, fractions, seed=None), 方法可以执行此操作,但我需要按 comm_type 对数据进行分组,然后按 ID 进行采样,我正在努力找出最好的方法。 There are other fields in the dataset as well but the sampling needs to happen on these two columns.
数据集中还有其他字段,但采样需要在这两列上进行。
The fractions for the comm_type should match the original data in the DF, E = 0.5, M = 0.4, P= 0.1, and the unique IDs in original DF is around 19 M, I only need to sample 1 M of the dataset keeping the comm_type fractions consistent to the original dataset. comm_type 的分数应该与 DF 中的原始数据匹配,E = 0.5,M = 0.4,P= 0.1,原始 DF 中的唯一 ID 约为 19 M,我只需要采样 1 M 的数据集,保持comm_type 分数与原始数据集一致。
Will appreciate any help or direction.将不胜感激任何帮助或指导。
You can use scikit learn train_test_split function.您可以使用 scikit 学习 train_test_split 功能。 Function accepts multiple columns for strata.
函数接受分层的多个列。
sklearn.model_selection.train_test_split(*arrays, test_size=None,
train_size=None, random_state=None, shuffle=True, stratify=df[columns to
stratify])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.