简体   繁体   中英

Delete rows of a data frame based on quantity of data in another data frame

I have two pandas data frames A and B . B is a subset of A.

I want to delete all numbers from A if it's in B. But , if a number occurs two times in A and 1 time in B then it will only delete 1 occurrence of the number from A.

Here is my sample data sets:

df_A                df_B
[Test]              [Test]
1                   1
2                   2         
3                   5 
2                   5 
4
5
5

After the operation I want new data frame c as

df_C
[Test]
3
2
4

Can you please suggest how to do that?

The suggested duplicate removes all occurrences from A if present in B , not just the first N occurrences.

If you create some intermediate values which contain the occurrence number for each value, then you can use pandas.Series.isin() to create a logical index of which rows of the dataframe to exclude like:

Code:

from collections import Counter

def occurrences_number(column):

    def occurrence_number(value, accumulator):
        """ tuple of value and occurrence number of value """
        accumulator[value] += 1
        return value, accumulator[value]

    occurrences = Counter()
    return column.apply(lambda x: occurrence_number(x, occurrences))

def find_not_in_by_occurrence_number(data, not_in):
    not_in_indices = ~occurrences_number(data).isin(occurrences_number(not_in))
    return data[not_in_indices].reset_index()

Test Code:

import pandas as pd

dfA = pd.DataFrame({'A': [1, 2, 3, 2, 4, 5, 5]})
dfB = pd.DataFrame({'B': [1, 2, 5, 5]})
print(dfA)
print(dfB)

dfC = find_not_in_by_occurrence_number(dfA.A, dfB.B).A

print (dfC)

Results:

   A
0  1
1  2
2  3
3  2
4  4
5  5
6  5

   B
0  1
1  2
2  5
3  5

0    3
1    2
2    4
Name: A, dtype: int64

I might do something (stealing SR's setup):

dfA = pd.DataFrame({'A': [1, 2, 3, 2, 4, 5, 5]})
dfB = pd.DataFrame({'B': [1, 2, 5, 5]})

counts = dfA.groupby('A').cumcount()
limits = dfB['B'].value_counts().reindex(dfA.A).fillna(0).values
dfC = dfA.loc[counts >= limits]

which gives me

In [121]: dfC
Out[121]: 
   A
2  3
3  2
4  4

This works by using groupby to get the number of times a given value in A has been seen before:

In [124]: dfA.groupby('A').cumcount()
Out[124]: 
0    0
1    0
2    0
3    1
4    0
5    0
6    1
dtype: int64

and using value_counts to get the limits which we then reindex to match the counts:

In [139]: dfB['B'].value_counts()
Out[139]: 
5    2
2    1
1    1
Name: B, dtype: int64

In [140]: dfB['B'].value_counts().reindex(dfA.A)
Out[140]: 
A
1    1.0
2    1.0
3    NaN
2    1.0
4    NaN
5    2.0
5    2.0
Name: B, dtype: float64

In this problem, you can make use of Counter and drop . For drop, you will need to know the indexes of rows that you want to discard.

import itertools
from collections import Counter
df = pd.DataFrame({'Test': {0: 1, 1: 2, 2: 3, 3: 2, 4: 4, 5: 5, 6: 5}})
df2 = pd.DataFrame({'Test': {0: 1, 1: 2, 2: 5, 3: 5}})
c_df2 = Counter(df2.Test)

indexes_to_remove_2d = [df.index[df['Test'] == k].tolist()[:v] 
                        for k, v in c_df2.items()]
# [[0], [1], [5, 6]]
merged = list(itertools.chain(*indexes_to_remove_2d))
# [0, 1, 5, 6]
df.drop(merged)

    Test
2   3
3   2
4   4

indexes_2d generates those indexes that fits the value k in the counter with df.index[df['Test'] == k] and has a [:v] to limit the size of indexes we take from.

Then, we merge these indexes with itertools.chain . Finally drop rows with these indexes.

Thanks for Stephan Ranch to point out the order problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM