[英]Selecting rows based on certain column values returns empty dataframe
I want to select rows from a dataframe based on different values of a certain column variable and make histograms.我想根据某个列变量的不同值从 dataframe 中提取 select 行并制作直方图。
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt
df_train=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.data')
df_train.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]
df_train.dropna(how='any')
df_train.loc[(df_train!=0).any(axis=1)]
#df_train.incomelevel = pd.to_numeric(df_train.incomelevel, errors =
'coerce').fillna(0).astype('Int64')
df_train.drop(columns='fnlwgt', inplace = True)
#df_test=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.test')
#df_train.boxplot(column = 'age', by = 'incomelevel', grid = False)
df_train.loc[df_train['incomelevel'] == '<=50K']
#df_train.loc[df_train['incomelevel'] == '>50K']
Output: Empty DataFrame Columns: [age, workclass, fnlwgt, education, educationnum, maritalstatus, occupation, relationship, race, sex, capitalgain, capitalloss, hoursperweek, nativecountry, incomelevel] Index: [] Output:空 DataFrame 列:[年龄,工种,fnlwgt,教育,教育,婚姻状况,职业,关系,种族,性别,资本收益,资本损失,每周工作时间,本国,收入水平]指数:[]
From the above lines you can derive that I'm trying to select rows that have income level of '<=50K'.从以上几行您可以得出我正在尝试 select 收入水平为“<=50K”的行。 The 'incomelevel' column is of object datatype. “incomelevel”列是 object 数据类型。 But when I try to print it, it just returns all the column names and mentions the dataframe as 'empty'.但是当我尝试打印它时,它只返回所有列名并提到 dataframe 为“空”。 Or when I run it as is in jupyter notebook without the print function, it just displays the dataframe with all the column names, except nothing under those columns.或者当我在没有打印 function 的情况下在 jupyter 笔记本中运行它时,它只显示 dataframe 以及所有列名,除了这些列下没有任何内容。
You should call the csv with skipinitialspace=True
because there are spaces in the front of each value, then it works:您应该使用skipinitialspace=True
调用 csv 因为每个值的前面都有空格,然后它可以工作:
df = pd.read_csv('adult.data', header=None, skipinitialspace=True)
df.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]
df = df[df['incomelevel']=='<=50K']
print(df.head())
age workclass fnlwgt education educationnum maritalstatus ... sex capitalgain capitalloss hoursperweek nativecountry incomelevel
0 39 State-gov 77516 Bachelors 13 Never-married ... Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse ... Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced ... Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse ... Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse ... Female 0 0 40 Cuba <=50K
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.