简体   繁体   English

根据其他列的唯一值从数据框中选择行?

[英]Select rows from dataframe based on a unique values of other column?

One of the column of my dataframe has values shown as below: 我的数据框的一列有如下所示的值:

air_voice_no_null.loc[:,"host_has_profile_pic"].value_counts(normalize = True)*100

1.0    99.694276
0.0     0.305724
Name: host_has_profile_pic, dtype: float64

that's 99:1 for each unique value in that column. 对于该列中的每个唯一值,该值为99:1。

I now want to make a new a dataframe, such that it has 60% of 1.0 and 40% of 0.0 from that dataframe along with all rows(of course with less number of rows). 我现在想要创建一个新的数据帧,使得它具有该数据帧的60%和0.0%的0.0%以及所有行(当然行数较少)。

I've tried splitting it using strat function of train_test_split class of sklearn.model_selection as shown below, but with no luck getting dataframes with equal proportions of each unique value. 我试着用它分裂strat的功能train_test_split类的sklearn.model_selection如下图所示,但没有运气得到dataframes与每个唯一值的比例相等。

from sklearn.model_selection import train_test_split

profile_train_x, profile_test_x, profile_train_y, profile_test_y = train_test_split(air_voice_no_null.loc[:,['log_price', 'accommodates', 'bathrooms','host_response_rate', 'number_of_reviews', 'review_scores_rating','bedrooms', 'beds', 'cleaning_fee', 'instant_bookable']],
                                                                                   air_voice_no_null.loc[:,"host_has_profile_pic"],
                                                                                   random_state=42, stratify=air_voice_no_null.loc[:,"host_has_profile_pic"])

and this is what the above code resulted in, with no change in number of rows. 这就是上面代码产生的结果,行数没有变化。

print(profile_train_x.shape)
print(profile_test_x.shape)
print(profile_train_y.shape)
print(profile_test_y.shape)

(55442, 10)
(18481, 10)
(55442,)
(18481,)

How do I select subset of my dataset with a decreased number of rows, while maintaining appropriate proportions of each class of the host_has_profile_pic variable. 如何选择行数减少的数据host_has_profile_pic集,同时保持host_has_profile_pic变量的每个类的适当比例。

link to the complete dataset: https://www.kaggle.com/stevezhenghp/airbnb-price-prediction 链接到完整的数据集: https//www.kaggle.com/stevezhenghp/airbnb-price-prediction

Consider the following way: 请考虑以下方式:

import pandas as pd

# create some data
df = pd.DataFrame({'a': [0] * 10 + [1] * 90})

print('original proportion:')
print(df['a'].value_counts(normalize=True))

# take samples for every unique value separately
df_new = pd.concat([
    df[df['a'] == 0].sample(frac=.4),
    df[df['a'] == 1].sample(frac=.07)])

print('\nsample proportion:')
print(df_new['a'].value_counts(normalize=True))

Output: 输出:

original proportion:
1    0.9
0    0.1
Name: a, dtype: float64

sample proportion:
1    0.6
0    0.4
Name: a, dtype: float64

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用基于(非唯一)列值的其他行中的值替换 DataFrame 行中的 NaN 值 - Replacing NaN values in a DataFrame row with values from other rows based on a (non-unique) column value 使用 Groupby 根据 Pandas 中列中的值从 DataFrame 中选择 CONSECUTIVE 行 - Select CONSECUTIVE rows from a DataFrame based on values in a column in Pandas with Groupby Pandas_select 基于列值从 dataframe 中选择行 - Pandas_select rows from a dataframe based on column values 如何根据列值从 DataFrame 中 select 行? - How do I select rows from a DataFrame based on column values? 根据熊猫列中的字符串值从DataFrame中选择行 - Select rows from a DataFrame based on string values in a column in pandas 根据熊猫列中值的最后一个字符从DataFrame中选择行 - Select rows from a DataFrame based on last characters of values in a column in pandas 熊猫根据其他数据框中的数据有条件地选择列值 - Pandas conditionally select column values based on data from other dataframe 根据另一个数据框中的值从DataFrame中选择行,并根据第二个DataFrame使用值更新其中一个列 - Select rows from a DataFrame based on a values in another dataframe and updating one of the column with values according to the second DataFrame Python DataFrame - Select dataframe rows based on values in a column of same dataframe - Python DataFrame - Select dataframe rows based on values in a column of same dataframe Pandas数据框根据查询数据框中的值选择行,然后根据列值选择其他条件 - Pandas Dataframe Select rows based on values from a lookup dataframe and then another condition based on column value
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM