简体   繁体   English

前导或尾随空格和熊猫value_counts与布尔选择

[英]Leading or trailing whitespace and pandas value_counts vs boolean selection

I am working with a dataframe created from a csv file downloaded from my county's Sheriff's Department. 我正在使用从县治安官部门下载的csv文件创建的数据框。 The data is located here and can be read in using read_csv() . 数据位于此处 ,可以使用read_csv()读取。 The dataframe contains information about incidents reported to and acted upon by the Sheriff. 数据框包含有关向警长报告并采取行动的事件的信息。 One of the columns is the city in which the incident occurred, and I'm trying to create a table and graph showing the change in number of incidents for my area (Larkfield) over time. 其中一列是发生事件的城市,我试图创建一个表格和图表来显示我所在地区(黑田)的事件数随时间的变化。

When I use panda's value_counts function using "city" as an input, I get 当我使用以“ city”作为输入的熊猫的value_counts函数时,我得到

In [86]: compcounts = soco['city'].value_counts()
In [96]: compcounts[0:10]
Out[96]:
SANTA ROSA              55291
WINDSOR                 31711
SONOMA                  28840
GUERNEVILLE              9309
BOYES HOT SPRINGS        8006
PETALUMA                 6103
EL VERANO                5969
GEYSERVILLE              5822
LARKFIELD                5398
FORESTVILLE              5312
dtype: int64`

There are 5398 reports for my area ('Larkfield'). 我所在地区('Larkfield')共有5398个报告。 But when I try to get a subset of the dataframe for my area, using 但是当我尝试获取我所在区域的数据框的子集时,使用

larkfieldcomps = soco[soco['city'] == "LARKFIELD"]

it returns only 115 values, not 5398: 它仅返回115个值,而不返回5398:

In [94]: larkcounts = larkfieldcomps['year'].value_counts()
In [95]: larkcounts
Out[95]:
2015    114
2013      1
dtype: int64

I thought maybe the problem was that in some entries there was one or more spaces before or after "LARKFIELD" in the incident description, so I did a search/replace to try to strip out any spaces, but I still get only 115 values when searching by "LARKFIELD," even though I know there are many more incidents in that area. 我认为问题可能出在某些条目中,事件描述中“ LARKFIELD”之前或之后存在一个或多个空格,因此我进行了搜索/替换以尝试去除任何空格,但是当即使我知道该地区还有更多事件,也可以通过“ LARKFIELD”进行搜索。

This is my first question on Stackoverflow ... I've researched this to death but haven't come up with an answer yet. 这是我关于Stackoverflow的第一个问题...我已经对此进行了研究,但是还没有得出答案。 Any suggestions would be appreciated. 任何建议,将不胜感激。

I can somewhat explain this after downloading the data (and reading into a dataframe with read_csv using default settings). 我可以在下载数据(并使用默认设置使用read_csv读入数据read_csv )后对此进行一些解释。 It appears that there are leading or trailing spaces in there. 似乎那里有前导或尾随空格。 Apparently value_counts is smart enough to ignore this when adding things up but the boolean selection is much more literal. 显然, value_counts足够聪明,可以在加总时忽略它,但是布尔选择更真实。

>>> soco[soco['city'] == "LARKFIELD"].city.count()
122

>>> soco['city2'] = soco.city.str.strip()

>>> soco[soco['city2'] == "LARKFIELD"].city.count()
5520

And when I look a little closer it seems that 5398 have 11 trailing spaces and 122 have no spaces. 当我仔细观察时,似乎5398有11个尾随空格,而122没有空格。 So that's the difference. 这就是区别。 (I'm not sure why you find 115 values for year instead of 122, but that's most likely due to some missing values for year, however you created it.) (我不确定为什么您会找到115个而不是122个值,但这很可能是因为缺少一些年份值,但是您还是创建了它。)

But then I did this to double check the behavior of value_counts because I had been assuming that leading and trailing spaces would matter. 但是后来我这样做是为了再次检查value_counts的行为,因为我一直认为前导和尾随空格很重要。

>>> pd.Series( [' foo','foo','foo '] ).value_counts() 
foo     1
foo     1
 foo    1

And, yeah, in this simple example leading and trailing blanks do indeed matter. 是的,在这个简单的示例中,前导空格和尾随空格确实很重要。 But they don't in your 'soco' dataframe??? 但是它们不在您的“ soco”数据框中???

So there are still some loose ends here, but hopefully this is a good start for figuring out what is happening here. 因此,这里仍然有一些松散的结局,但希望这是弄清这里正在发生的事情的一个好的开始。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM