Selecting rows based on certain column values returns empty dataframe

Question

I want to select rows from a dataframe based on different values of a certain column variable and make histograms.

import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt

df_train=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.data')
df_train.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]

df_train.dropna(how='any')
df_train.loc[(df_train!=0).any(axis=1)]
#df_train.incomelevel = pd.to_numeric(df_train.incomelevel, errors = 
'coerce').fillna(0).astype('Int64')
df_train.drop(columns='fnlwgt', inplace = True)

#df_test=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.test')

#df_train.boxplot(column = 'age', by = 'incomelevel', grid = False)

df_train.loc[df_train['incomelevel'] == '<=50K']
#df_train.loc[df_train['incomelevel'] == '>50K']

Output: Empty DataFrame Columns: [age, workclass, fnlwgt, education, educationnum, maritalstatus, occupation, relationship, race, sex, capitalgain, capitalloss, hoursperweek, nativecountry, incomelevel] Index: []

From the above lines you can derive that I'm trying to select rows that have income level of '<=50K'. The 'incomelevel' column is of object datatype. But when I try to print it, it just returns all the column names and mentions the dataframe as 'empty'. Or when I run it as is in jupyter notebook without the print function, it just displays the dataframe with all the column names, except nothing under those columns.

Answer 1

You should call the csv with skipinitialspace=True because there are spaces in the front of each value, then it works:

df = pd.read_csv('adult.data', header=None, skipinitialspace=True)
df.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]
df = df[df['incomelevel']=='<=50K']
print(df.head())

  age         workclass  fnlwgt  education  educationnum       maritalstatus  ...     sex capitalgain capitalloss hoursperweek  nativecountry  incomelevel
0   39         State-gov   77516  Bachelors            13       Never-married  ...    Male        2174           0           40  United-States        <=50K
1   50  Self-emp-not-inc   83311  Bachelors            13  Married-civ-spouse  ...    Male           0           0           13  United-States        <=50K
2   38           Private  215646    HS-grad             9            Divorced  ...    Male           0           0           40  United-States        <=50K
3   53           Private  234721       11th             7  Married-civ-spouse  ...    Male           0           0           40  United-States        <=50K
4   28           Private  338409  Bachelors            13  Married-civ-spouse  ...  Female           0           0           40           Cuba        <=50K

Selecting rows based on certain column values returns empty dataframe

Question

1 answers

solution1
2 ACCPTED 2020-06-05 16:23:22

Selecting rows based on certain column values returns empty dataframe

Question

1 answers

solution1 2 ACCPTED 2020-06-05 16:23:22

solution1
2 ACCPTED 2020-06-05 16:23:22