简体   繁体   中英

Leading or trailing whitespace and pandas value_counts vs boolean selection

I am working with a dataframe created from a csv file downloaded from my county's Sheriff's Department. The data is located here and can be read in using read_csv() . The dataframe contains information about incidents reported to and acted upon by the Sheriff. One of the columns is the city in which the incident occurred, and I'm trying to create a table and graph showing the change in number of incidents for my area (Larkfield) over time.

When I use panda's value_counts function using "city" as an input, I get

In [86]: compcounts = soco['city'].value_counts()
In [96]: compcounts[0:10]
Out[96]:
SANTA ROSA              55291
WINDSOR                 31711
SONOMA                  28840
GUERNEVILLE              9309
BOYES HOT SPRINGS        8006
PETALUMA                 6103
EL VERANO                5969
GEYSERVILLE              5822
LARKFIELD                5398
FORESTVILLE              5312
dtype: int64`

There are 5398 reports for my area ('Larkfield'). But when I try to get a subset of the dataframe for my area, using

larkfieldcomps = soco[soco['city'] == "LARKFIELD"]

it returns only 115 values, not 5398:

In [94]: larkcounts = larkfieldcomps['year'].value_counts()
In [95]: larkcounts
Out[95]:
2015    114
2013      1
dtype: int64

I thought maybe the problem was that in some entries there was one or more spaces before or after "LARKFIELD" in the incident description, so I did a search/replace to try to strip out any spaces, but I still get only 115 values when searching by "LARKFIELD," even though I know there are many more incidents in that area.

This is my first question on Stackoverflow ... I've researched this to death but haven't come up with an answer yet. Any suggestions would be appreciated.

I can somewhat explain this after downloading the data (and reading into a dataframe with read_csv using default settings). It appears that there are leading or trailing spaces in there. Apparently value_counts is smart enough to ignore this when adding things up but the boolean selection is much more literal.

>>> soco[soco['city'] == "LARKFIELD"].city.count()
122

>>> soco['city2'] = soco.city.str.strip()

>>> soco[soco['city2'] == "LARKFIELD"].city.count()
5520

And when I look a little closer it seems that 5398 have 11 trailing spaces and 122 have no spaces. So that's the difference. (I'm not sure why you find 115 values for year instead of 122, but that's most likely due to some missing values for year, however you created it.)

But then I did this to double check the behavior of value_counts because I had been assuming that leading and trailing spaces would matter.

>>> pd.Series( [' foo','foo','foo '] ).value_counts() 
foo     1
foo     1
 foo    1

And, yeah, in this simple example leading and trailing blanks do indeed matter. But they don't in your 'soco' dataframe???

So there are still some loose ends here, but hopefully this is a good start for figuring out what is happening here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM