简体   繁体   中英

Pandas - disappearing values in value_counts()

I started this question yesterday and have done more work on it.

Thanks @AMC , @ALollz

I have a dataframe of surgical activity data that has 58 columns and 200,000 records. One of the columns is treatment specialty Each row corresponds to a patient encounter. I want to see the relative conribution of medical specialties. One column is 'TRETSPEF' = treatment_specialty. I have used `pd.read_csv('csv, usecols = ['TRETSPEF') to import the series.

df
    TRETSPEF
0   150
1   150
2   150
3   150
4   150
... ...
218462  150
218463  &
218464  150
218465  150
218466  218`


The most common treatment specialty is neurosurgery (code 150). So heres the problem. When I apply .value_counts I get two groups for the 150 code (and the 218 code)

df['TRETSPEF'].value_counts()
150    140411
150     40839
218     13692
108     10552
218      4143
        ...  
501         1
120         1
302         1
219         1
106         1
Name: TRETSPEF, Length: 69, dtype: int64

There are some '&' in there (454) so I wondered if the fact they aren't integers was messing things up so I changed them to null values, and ran value counts.


df['TRETSPEF'].str.replace("&", "").value_counts()
150    140411
218     13692
108     10552
800       858
110       835
811       692
191       580
323       555
          454
100       271
400       116
420        47
301        45
812        38
214        24
215        23
180        22
300        17
370        15
421        11
258        11
314         5
422         4
260         4
192         4
242         4
171         4
350         2
307         2
302         2
328         2
160         1
219         1
120         1
107         1
101         1
143         1
501         1
144         1
320         1
104         1
106         1
430         1
264         1
Name: TRETSPEF, dtype: int64

so now I seem to have lost the second group of 150 - about 40000 records by changing '&' to null. The nulls are still showing up in .value_counts though.The length of the series has gone down to 45 fromn 69. I tried stripping whitespace - no difference. Not sure what tests to run to see why this is happening. I feel it must somehow be due to the data.

This is 100% a data cleansing issue. Try to force the column to be numeric.

pd.to_numeric(df['TRETSPEF'], errors='coerce').value_counts()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM