Deleting rows in pandas dataframe based on pair value

Question

I have dataframe as below:

df = pd.DataFrame({'User':['a','a','a','b','b','b'],
                 'Type':['101','102','101','101','101','102'],
                 'Qty':[10, -10, 10, 30, 5, -5]})

I want to remove pair value of df['Type'] = 101 and 102 where df['Qty'] net off each other. The end result would be as such:

df = pd.DataFrame({'User':['a','b'],
                     'Type':['101', '101'],
                     'Qty':[10, 30})

I tried to convert the negative value into absolute number and remove duplicates as such:

df['Qty'] = df['Qty'].abs()
df.drop_duplicates(subset=['Qty'], keep='first')

But then it wrongly give me such dataframe:

df = pd.DataFrame({'User':['a','b', 'b'],
                     'Type':['101', '101', '101'],
                     'Qty':[10, 30, 5})

Answer 1

Idea is create combinations of index values per groups and test if each subgroup contains both Type s and sum is 0 for set ot this matched pairs:

#solution need unique index values
df = df.reset_index(drop=True)

from  itertools import combinations
    
out = set()
def f(x):
    for i in combinations(x.index, 2):
        a = x.loc[list(i)]
        if (set(a['Type']) == set(['101','102'])) and (a['Qty'].sum() == 0):
           out.add(i)

df.groupby('User').apply(f)

print (out)
{(0, 1), (4, 5), (1, 2)}

Then remove all pairs if duplicated some value, like here (1,2) :

s = pd.Series(list(out)).explode()
idx = s.index[s.duplicated()]
final = s.drop(idx)
print (final)
0    0
0    1
1    4
1    5
dtype: object

And last remove rows from original:

df = df.drop(final)
print (df)
  User Type  Qty
2    a  101   10
3    b  101   30

Answer 2

If there are only two 'Type' s ^{^{(in this case 101 and 102 )}} then you could write a custom function as follows:

Build a dictionary with keys containing absolute values of 'Qty' .
Values of the dictionary contain a list of 'Type' values corresponding to 'Qty' .

from collections import defaultdict
def f(x):
    new = defaultdict(list)
    for k,v in x[['Type', 'Qty']].itertuples(index=None,name=None):
        if not new[abs(v)]:
            new[abs(v)].append(k)
        elif new[abs(v)][-1] !=k:
            new[abs(v)].pop()
        else:
            new[abs(v)].append(k)
    return pd.Series(new,name='Qty').rename_axis(index='Type')

The logic is simple:

whenever a new key is encountered add it's corresponding 'Type' to the list.
if it's already existing key then check if last value ie 'Type' which was added earlier is equal to current 'Type' value. If they both don't match for example, if new = {10:['101']} and current key is '102' remove '101' . So, new = {10:[]}
if it's key is already existing and last 'Type' and current 'Type' match, simply append current 'Type' to the list for example, if new = {10:['101']} and the current 'Type' is '101' then append to it. So, new = {10:['101', '101']} .

df.groupby('User').apply(f).explode().dropna().reset_index()

  User  Type  Qty
0    a    10  101
1    b    30  101

Answer 3

Iterating over all records and saving matches in a list that ensures no index is paired more than once seems to work here.


import pandas as pd

df = pd.DataFrame({'User':['a','a','a','b','b','b'],
                 'Type':['101','102','101','101','101','102'],
                 'Qty':[10, -10, 10, 30, 5, -5]})



# create a list to collect all indices that we are going to remove
records_to_remove = []
# a dictionary to map which group mirrors the other
pair = {'101': '102', '102':'101'}

# let's go over each row one by one,
for i in df.index:
    current_record = df.iloc[i]
    # if we haven't stored this index already for removal
    if i not in records_to_remove:
        pair_type = pair[current_record['Type']]
        pair_quantity = -1*current_record['Qty']
        # search for all possible matches to this row
        match_records = df[(df['Type']==pair_type) & (df['Qty']==pair_quantity)]
        if match_records.empty:
            # if no matches fond move on to the next row
            continue
        else:
            # if a match is found, take the first of such records
            first_match_index = match_records.index[0]
            if first_match_index not in records_to_remove:
                # store the indices in the list to remove only if they're not already present
                records_to_remove.append(i)
                records_to_remove.append(first_match_index)
                
df = df.drop(records_to_remove)

Output:

   User Type  Qty
2     a  101   10
3     b  101   30

See if this works for you!

Deleting rows in pandas dataframe based on pair value

Question

3 answers

solution1
3 ACCPTED 2020-07-02 06:25:49

solution2
2 2020-07-02 09:18:53

solution3
2 2020-07-02 09:22:31

Deleting rows in pandas dataframe based on pair value

Question

3 answers

solution1 3 ACCPTED 2020-07-02 06:25:49

solution2 2 2020-07-02 09:18:53

solution3 2 2020-07-02 09:22:31

solution1
3 ACCPTED 2020-07-02 06:25:49

solution2
2 2020-07-02 09:18:53

solution3
2 2020-07-02 09:22:31