简体   繁体   English

如何对 PySpark Dataframe 中的两列进行分层抽样?

[英]How to do stratified sampling on two columns in PySpark Dataframe?

I want to sample below data set based on IDs and the comm_type they fall into;我想根据 ID 和它们所属的 comm_type 对以下数据集进行采样; The same IDs can have multiple comm_types, the data set is huge so I want to do further analysis on a smaller sample of 1 million unique IDs;同一个 ID 可以有多个 comm_types,数据集很大,所以我想对 100 万个唯一 ID 的较小样本做进一步分析; I see there is a sampleBy(col, fractions, seed=None), method to perform this but I need to group the data by comm_type and then sample by IDs, I am struggling to figure out the best way to do it.我看到有一个 sampleBy(col, fractions, seed=None), 方法可以执行此操作,但我需要按 comm_type 对数据进行分组,然后按 ID 进行采样,我正在努力找出最好的方法。 There are other fields in the dataset as well but the sampling needs to happen on these two columns.数据集中还有其他字段,但采样需要在这两列上进行。

The fractions for the comm_type should match the original data in the DF, E = 0.5, M = 0.4, P= 0.1, and the unique IDs in original DF is around 19 M, I only need to sample 1 M of the dataset keeping the comm_type fractions consistent to the original dataset. comm_type 的分数应该与 DF 中的原始数据匹配,E = 0.5,M = 0.4,P= 0.1,原始 DF 中的唯一 ID 约为 19 M,我只需要采样 1 M 的数据集,保持comm_type 分数与原始数据集一致。

在此处输入图像描述

Will appreciate any help or direction.将不胜感激任何帮助或指导。

You can use scikit learn train_test_split function.您可以使用 scikit 学习 train_test_split 功能。 Function accepts multiple columns for strata.函数接受分层的多个列。

sklearn.model_selection.train_test_split(*arrays, test_size=None, 
train_size=None, random_state=None, shuffle=True, stratify=df[columns to 
stratify])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Python进行随机分层抽样(不是训练/测试拆分)? - How to do a random stratified sampling with Python (Not a train/test split)? 如何进行多标签分层抽样? - How to perform MultiLabel stratified sampling? 如何减去 pyspark dataframe 中的两个字符串列? - How to substract two string columns in pyspark dataframe? 如何在 pyspark 中创建具有两个 dataframe 列的字典? - How to create a dictionary with two dataframe columns in pyspark? pyspark - 如何 select 在分层随机抽样中使用 (df.sampleByKey()) 每层的确切记录数 - pyspark - how to select exact number of records per strata using (df.sampleByKey()) in stratified random sampling PYSPARK:如何在pyspark数据框中找到两列的余弦相似度? - PYSPARK: How to find cosine similarity of two columns in a pyspark dataframe? 基于多列的熊猫分层抽样 - Pandas stratified sampling based on multiple columns 如何对 pyspark dataframe 列进行矢量运算? - How to do vector operations on pyspark dataframe columns? pySpark DataFrame:如何并行比较两个 dataframe 的列? - pySpark DataFrame: how to parallelize compare the columns of two dataframe? 如何过滤 dataframe 以确保两列中的值在 Pyspark 中不同? - How do you filter a dataframe to make sure values in two columns differ in Pyspark?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM