I am working with a dataframe created from a csv file downloaded from my county's Sheriff's Department. The data is located here and can be read in using read_csv()
. The dataframe contains information about incidents reported to and acted upon by the Sheriff. One of the columns is the city in which the incident occurred, and I'm trying to create a table and graph showing the change in number of incidents for my area (Larkfield) over time.
When I use panda's value_counts function using "city" as an input, I get
In [86]: compcounts = soco['city'].value_counts()
In [96]: compcounts[0:10]
Out[96]:
SANTA ROSA 55291
WINDSOR 31711
SONOMA 28840
GUERNEVILLE 9309
BOYES HOT SPRINGS 8006
PETALUMA 6103
EL VERANO 5969
GEYSERVILLE 5822
LARKFIELD 5398
FORESTVILLE 5312
dtype: int64`
There are 5398 reports for my area ('Larkfield'). But when I try to get a subset of the dataframe for my area, using
larkfieldcomps = soco[soco['city'] == "LARKFIELD"]
it returns only 115 values, not 5398:
In [94]: larkcounts = larkfieldcomps['year'].value_counts()
In [95]: larkcounts
Out[95]:
2015 114
2013 1
dtype: int64
I thought maybe the problem was that in some entries there was one or more spaces before or after "LARKFIELD" in the incident description, so I did a search/replace to try to strip out any spaces, but I still get only 115 values when searching by "LARKFIELD," even though I know there are many more incidents in that area.
This is my first question on Stackoverflow ... I've researched this to death but haven't come up with an answer yet. Any suggestions would be appreciated.
I can somewhat explain this after downloading the data (and reading into a dataframe with read_csv
using default settings). It appears that there are leading or trailing spaces in there. Apparently value_counts
is smart enough to ignore this when adding things up but the boolean selection is much more literal.
>>> soco[soco['city'] == "LARKFIELD"].city.count()
122
>>> soco['city2'] = soco.city.str.strip()
>>> soco[soco['city2'] == "LARKFIELD"].city.count()
5520
And when I look a little closer it seems that 5398 have 11 trailing spaces and 122 have no spaces. So that's the difference. (I'm not sure why you find 115 values for year instead of 122, but that's most likely due to some missing values for year, however you created it.)
But then I did this to double check the behavior of value_counts
because I had been assuming that leading and trailing spaces would matter.
>>> pd.Series( [' foo','foo','foo '] ).value_counts()
foo 1
foo 1
foo 1
And, yeah, in this simple example leading and trailing blanks do indeed matter. But they don't in your 'soco' dataframe???
So there are still some loose ends here, but hopefully this is a good start for figuring out what is happening here.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.