简体   繁体   English

根据某些列值选择行返回空 dataframe

[英]Selecting rows based on certain column values returns empty dataframe

I want to select rows from a dataframe based on different values of a certain column variable and make histograms.我想根据某个列变量的不同值从 dataframe 中提取 select 行并制作直方图。

import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt

df_train=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.data')
df_train.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]

df_train.dropna(how='any')
df_train.loc[(df_train!=0).any(axis=1)]
#df_train.incomelevel = pd.to_numeric(df_train.incomelevel, errors = 
'coerce').fillna(0).astype('Int64')
df_train.drop(columns='fnlwgt', inplace = True)

#df_test=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.test')

#df_train.boxplot(column = 'age', by = 'incomelevel', grid = False)

df_train.loc[df_train['incomelevel'] == '<=50K']
#df_train.loc[df_train['incomelevel'] == '>50K']

Output: Empty DataFrame Columns: [age, workclass, fnlwgt, education, educationnum, maritalstatus, occupation, relationship, race, sex, capitalgain, capitalloss, hoursperweek, nativecountry, incomelevel] Index: [] Output:空 DataFrame 列:[年龄,工种,fnlwgt,教育,教育,婚姻状况,职业,关系,种族,性别,资本收益,资本损失,每周工作时间,本国,收入水平]指数:[]

From the above lines you can derive that I'm trying to select rows that have income level of '<=50K'.从以上几行您可以得出我正在尝试 select 收入水平为“<=50K”的行。 The 'incomelevel' column is of object datatype. “incomelevel”列是 object 数据类型。 But when I try to print it, it just returns all the column names and mentions the dataframe as 'empty'.但是当我尝试打印它时,它只返回所有列名并提到 dataframe 为“空”。 Or when I run it as is in jupyter notebook without the print function, it just displays the dataframe with all the column names, except nothing under those columns.或者当我在没有打印 function 的情况下在 jupyter 笔记本中运行它时,它只显示 dataframe 以及所有列名,除了这些列下没有任何内容。

You should call the csv with skipinitialspace=True because there are spaces in the front of each value, then it works:您应该使用skipinitialspace=True调用 csv 因为每个值的前面都有空格,然后它可以工作:

df = pd.read_csv('adult.data', header=None, skipinitialspace=True)
df.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]
df = df[df['incomelevel']=='<=50K']
print(df.head())

  age         workclass  fnlwgt  education  educationnum       maritalstatus  ...     sex capitalgain capitalloss hoursperweek  nativecountry  incomelevel
0   39         State-gov   77516  Bachelors            13       Never-married  ...    Male        2174           0           40  United-States        <=50K
1   50  Self-emp-not-inc   83311  Bachelors            13  Married-civ-spouse  ...    Male           0           0           13  United-States        <=50K
2   38           Private  215646    HS-grad             9            Divorced  ...    Male           0           0           40  United-States        <=50K
3   53           Private  234721       11th             7  Married-civ-spouse  ...    Male           0           0           40  United-States        <=50K
4   28           Private  338409  Bachelors            13  Married-civ-spouse  ...  Female           0           0           40           Cuba        <=50K

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM