One of the column of my dataframe has values shown as below:
air_voice_no_null.loc[:,"host_has_profile_pic"].value_counts(normalize = True)*100
1.0 99.694276
0.0 0.305724
Name: host_has_profile_pic, dtype: float64
that's 99:1 for each unique value in that column.
I now want to make a new a dataframe, such that it has 60% of 1.0 and 40% of 0.0 from that dataframe along with all rows(of course with less number of rows).
I've tried splitting it using strat
function of train_test_split
class of sklearn.model_selection
as shown below, but with no luck getting dataframes with equal proportions of each unique value.
from sklearn.model_selection import train_test_split
profile_train_x, profile_test_x, profile_train_y, profile_test_y = train_test_split(air_voice_no_null.loc[:,['log_price', 'accommodates', 'bathrooms','host_response_rate', 'number_of_reviews', 'review_scores_rating','bedrooms', 'beds', 'cleaning_fee', 'instant_bookable']],
air_voice_no_null.loc[:,"host_has_profile_pic"],
random_state=42, stratify=air_voice_no_null.loc[:,"host_has_profile_pic"])
and this is what the above code resulted in, with no change in number of rows.
print(profile_train_x.shape)
print(profile_test_x.shape)
print(profile_train_y.shape)
print(profile_test_y.shape)
(55442, 10)
(18481, 10)
(55442,)
(18481,)
How do I select subset of my dataset with a decreased number of rows, while maintaining appropriate proportions of each class of the host_has_profile_pic
variable.
link to the complete dataset: https://www.kaggle.com/stevezhenghp/airbnb-price-prediction
Consider the following way:
import pandas as pd
# create some data
df = pd.DataFrame({'a': [0] * 10 + [1] * 90})
print('original proportion:')
print(df['a'].value_counts(normalize=True))
# take samples for every unique value separately
df_new = pd.concat([
df[df['a'] == 0].sample(frac=.4),
df[df['a'] == 1].sample(frac=.07)])
print('\nsample proportion:')
print(df_new['a'].value_counts(normalize=True))
Output:
original proportion:
1 0.9
0 0.1
Name: a, dtype: float64
sample proportion:
1 0.6
0 0.4
Name: a, dtype: float64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.