I'm trying to subset data in a pandas dataframe based on values that exist in a separate array. Below is a sample example that does work and illustrates what I'm trying to do:
import pandas as pd
import numpy as np
mysubset = np.array([1,2,3,4])
d = {'col1': [1, 2, 3, 4, 5, 6], 'col2': [3, 4, 1, 3, 5, 5]}
df = pd.DataFrame(data=d)
df[df['col1'].isin(mysubset)]
Using that working code as a prototype, I'm implementing (what I think is) the same process on my actual real data, but it doesn't work. My real data look like
>>> tmp.head()
ItemID P0
44 26785 0.276844507
61 26534 1.4108438640000001
71 14107 1.0652574239999999
86 26530 1.1059459039999999
93 18142 0.903011679
and the array I want to use for subsetting is
>>> op_items
array([18692, 18694, 18696, 18706, 18711, 18714, 18716, 18722, 19332,
19333, 26526, 26527, 26530, 26532, 26533, 26534, 26535, 26536,
26538, 26541, 14107, 14110, 14120, 14149, 14165, 17984, 18004,
18005, 18006, 18007, 18008, 18134, 18136, 18139, 18141, 18142,
19081, 19084, 19086, 20789, 20794, 20796, 20800, 20802, 26784,
26785, 26786, 26787], dtype=int64)
Using this as in the toy example above gives
>>> tmp[tmp['ItemID'].isin(op_items)]
Empty DataFrame
Columns: [ItemID, P0]
Index: []
But, manually grabbing some elements from within a list does work:
>>> tmp[tmp['ItemID'].isin(['18692', '18696'])]
ItemID P0
236 18696 0.566035305
624 18692 0.60981902
Using the following confirms they are of the same form as in the toy example
>>> type(op_items)
<class 'numpy.ndarray'>
>>> type(tmp['ItemID'])
<class 'pandas.core.series.Series'>
So, I am uncertain what other mistake I am making and could use a pointer. I realize in the example where I hardcoded and grabbed I cast the values in a list. But, the toy example above uses the isin
feature where mysubset
is an array similar to op_items
.
Thank you My question differs from this one in that I'm not worried about duplicates, subset pandas dataframe with corresponding numpy array .
Your op_items
is an array of integers, whereas your tmp['ItemID']
is string type. Use:
tmp['ItemID'] = tmp['ItemID'].astype('Int64')
tmp[tmp['ItemID'].isin(op_items)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.