I have a pandas dataframe with 3 columns where:
Values dtype - float
df = pd.DataFrame() df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a'] df['date'] = ['2018-01-01', '2018-01-01', '2018-01-03', '2018-01-05', '2018-01-01', '2018-01-02', '2018-01-06', '2018-01-03', '2018-01-04','2018-01-01'] df['values'] = [1, 2, -1.5, 2.3, 5, -0.7, -5.2, -5.2, 1, -1.1] df
I want to filter for rows that have a positive value and a negative value (with the least difference) close to that date per category.
So, essentially an output that looks like:
df = pd.DataFrame()
df['category'] = ['a', 'a','b', 'b', 'c', 'c']
df['date'] = ['2018-01-01', '2018-01-01', '2018-01-01', '2018-01-03', '2018-01-01', '2018-01-03']
df['values'] = [1, -1.1, 2, -1.5, 5, -5.2]
df
I have looked at similar queries on SO ( Identifying closest value in a column for each filter using Pandas , How do I find the closest values in a Pandas series to an input number? )
The first one utilises idxmin
, which returns first occurence, not the closest in value.
The second link is speaking about a specific value as an input - I don't think a pure np.argsort
works in my case.
I can imagine using a complex web of if statements to do this, but, I'm not sure what the most efficient way of doing this is with pandas.
Any guidance would be greatly appreciated.
IIUC, sort your dataframe first then use idxmin
:
df1 = df.sort_values(['category','date'])
df1[df1.groupby('category')['values']\
.transform(lambda x: x.index.isin([x.ge(0).idxmin(), x.lt(0).idxmin()]))]
Output:
category date values
0 a 2018-01-01 1.0
9 a 2018-01-01 -1.1
1 b 2018-01-01 2.0
2 b 2018-01-03 -1.5
4 c 2018-01-01 5.0
7 c 2018-01-03 -5.2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.