简体   繁体   中英

Selecting rows based on certain column values returns empty dataframe

I want to select rows from a dataframe based on different values of a certain column variable and make histograms.

import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt

df_train=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.data')
df_train.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]

df_train.dropna(how='any')
df_train.loc[(df_train!=0).any(axis=1)]
#df_train.incomelevel = pd.to_numeric(df_train.incomelevel, errors = 
'coerce').fillna(0).astype('Int64')
df_train.drop(columns='fnlwgt', inplace = True)

#df_test=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.test')

#df_train.boxplot(column = 'age', by = 'incomelevel', grid = False)

df_train.loc[df_train['incomelevel'] == '<=50K']
#df_train.loc[df_train['incomelevel'] == '>50K']

Output: Empty DataFrame Columns: [age, workclass, fnlwgt, education, educationnum, maritalstatus, occupation, relationship, race, sex, capitalgain, capitalloss, hoursperweek, nativecountry, incomelevel] Index: []

From the above lines you can derive that I'm trying to select rows that have income level of '<=50K'. The 'incomelevel' column is of object datatype. But when I try to print it, it just returns all the column names and mentions the dataframe as 'empty'. Or when I run it as is in jupyter notebook without the print function, it just displays the dataframe with all the column names, except nothing under those columns.

You should call the csv with skipinitialspace=True because there are spaces in the front of each value, then it works:

df = pd.read_csv('adult.data', header=None, skipinitialspace=True)
df.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]
df = df[df['incomelevel']=='<=50K']
print(df.head())

  age         workclass  fnlwgt  education  educationnum       maritalstatus  ...     sex capitalgain capitalloss hoursperweek  nativecountry  incomelevel
0   39         State-gov   77516  Bachelors            13       Never-married  ...    Male        2174           0           40  United-States        <=50K
1   50  Self-emp-not-inc   83311  Bachelors            13  Married-civ-spouse  ...    Male           0           0           13  United-States        <=50K
2   38           Private  215646    HS-grad             9            Divorced  ...    Male           0           0           40  United-States        <=50K
3   53           Private  234721       11th             7  Married-civ-spouse  ...    Male           0           0           40  United-States        <=50K
4   28           Private  338409  Bachelors            13  Married-civ-spouse  ...  Female           0           0           40           Cuba        <=50K

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM