[英]Pandas - disappearing values in value_counts()
I started this question yesterday and have done more work on it. 我昨天开始了这个问题,并做了更多的工作。
Thanks @AMC , @ALollz谢谢@AMC,@ALollz
I have a dataframe of surgical activity data that has 58 columns and 200,000 records.我有一个包含 58 列和 200,000 条记录的手术活动数据的数据框。 One of the columns is treatment specialty Each row corresponds to a patient encounter.其中一列是治疗专业 每行对应一个患者就诊。 I want to see the relative conribution of medical specialties.我想看看医学专业的相对贡献。 One column is 'TRETSPEF' = treatment_specialty.一列是“TRETSPEF”=treatment_specialty。 I have used `pd.read_csv('csv, usecols = ['TRETSPEF') to import the series.我已经使用 `pd.read_csv('csv, usecols = ['TRETSPEF') 来导入该系列。
df
TRETSPEF
0 150
1 150
2 150
3 150
4 150
... ...
218462 150
218463 &
218464 150
218465 150
218466 218`
The most common treatment specialty is neurosurgery (code 150).最常见的治疗专业是神经外科(代码 150)。 So heres the problem.所以问题来了。 When I apply .value_counts
I get two groups for the 150 code (and the 218 code)当我应用.value_counts
我得到两组 150 代码(和 218 代码)
df['TRETSPEF'].value_counts()
150 140411
150 40839
218 13692
108 10552
218 4143
...
501 1
120 1
302 1
219 1
106 1
Name: TRETSPEF, Length: 69, dtype: int64
There are some '&' in there (454) so I wondered if the fact they aren't integers was messing things up so I changed them to null values, and ran value counts.那里有一些“&”(454),所以我想知道它们不是整数的事实是否把事情搞砸了,所以我将它们更改为空值,并运行值计数。
df['TRETSPEF'].str.replace("&", "").value_counts()
150 140411
218 13692
108 10552
800 858
110 835
811 692
191 580
323 555
454
100 271
400 116
420 47
301 45
812 38
214 24
215 23
180 22
300 17
370 15
421 11
258 11
314 5
422 4
260 4
192 4
242 4
171 4
350 2
307 2
302 2
328 2
160 1
219 1
120 1
107 1
101 1
143 1
501 1
144 1
320 1
104 1
106 1
430 1
264 1
Name: TRETSPEF, dtype: int64
so now I seem to have lost the second group of 150 - about 40000 records by changing '&' to null.所以现在我似乎已经通过将 '&' 更改为 null 丢失了第二组 150 - 大约 40000 条记录。 The nulls are still showing up in .value_counts though.The length of the series has gone down to 45 fromn 69. I tried stripping whitespace - no difference.空值仍然出现在 .value_counts 中。系列的长度从 69 下降到 45。我尝试去除空格 - 没有区别。 Not sure what tests to run to see why this is happening.不确定要运行哪些测试以了解为什么会发生这种情况。 I feel it must somehow be due to the data.我觉得这一定是由于数据造成的。
This is 100% a data cleansing issue.这是 100% 的数据清理问题。 Try to force the column to be numeric.尝试强制列为数字。
pd.to_numeric(df['TRETSPEF'], errors='coerce').value_counts()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.