I have two pandas data frames A
and B
. B
is a subset of A.
I want to delete all numbers from A if it's in B. But , if a number occurs two times in A and 1 time in B then it will only delete 1 occurrence of the number from A.
Here is my sample data sets:
df_A df_B
[Test] [Test]
1 1
2 2
3 5
2 5
4
5
5
After the operation I want new data frame c as
df_C
[Test]
3
2
4
Can you please suggest how to do that?
The suggested duplicate removes all occurrences from A
if present in B
, not just the first N occurrences.
If you create some intermediate values which contain the occurrence number for each value, then you can use pandas.Series.isin()
to create a logical index of which rows of the dataframe to exclude like:
from collections import Counter
def occurrences_number(column):
def occurrence_number(value, accumulator):
""" tuple of value and occurrence number of value """
accumulator[value] += 1
return value, accumulator[value]
occurrences = Counter()
return column.apply(lambda x: occurrence_number(x, occurrences))
def find_not_in_by_occurrence_number(data, not_in):
not_in_indices = ~occurrences_number(data).isin(occurrences_number(not_in))
return data[not_in_indices].reset_index()
import pandas as pd
dfA = pd.DataFrame({'A': [1, 2, 3, 2, 4, 5, 5]})
dfB = pd.DataFrame({'B': [1, 2, 5, 5]})
print(dfA)
print(dfB)
dfC = find_not_in_by_occurrence_number(dfA.A, dfB.B).A
print (dfC)
A
0 1
1 2
2 3
3 2
4 4
5 5
6 5
B
0 1
1 2
2 5
3 5
0 3
1 2
2 4
Name: A, dtype: int64
I might do something (stealing SR's setup):
dfA = pd.DataFrame({'A': [1, 2, 3, 2, 4, 5, 5]})
dfB = pd.DataFrame({'B': [1, 2, 5, 5]})
counts = dfA.groupby('A').cumcount()
limits = dfB['B'].value_counts().reindex(dfA.A).fillna(0).values
dfC = dfA.loc[counts >= limits]
which gives me
In [121]: dfC
Out[121]:
A
2 3
3 2
4 4
This works by using groupby to get the number of times a given value in A has been seen before:
In [124]: dfA.groupby('A').cumcount()
Out[124]:
0 0
1 0
2 0
3 1
4 0
5 0
6 1
dtype: int64
and using value_counts
to get the limits which we then reindex to match the counts:
In [139]: dfB['B'].value_counts()
Out[139]:
5 2
2 1
1 1
Name: B, dtype: int64
In [140]: dfB['B'].value_counts().reindex(dfA.A)
Out[140]:
A
1 1.0
2 1.0
3 NaN
2 1.0
4 NaN
5 2.0
5 2.0
Name: B, dtype: float64
In this problem, you can make use of Counter
and drop
. For drop, you will need to know the indexes of rows that you want to discard.
import itertools
from collections import Counter
df = pd.DataFrame({'Test': {0: 1, 1: 2, 2: 3, 3: 2, 4: 4, 5: 5, 6: 5}})
df2 = pd.DataFrame({'Test': {0: 1, 1: 2, 2: 5, 3: 5}})
c_df2 = Counter(df2.Test)
indexes_to_remove_2d = [df.index[df['Test'] == k].tolist()[:v]
for k, v in c_df2.items()]
# [[0], [1], [5, 6]]
merged = list(itertools.chain(*indexes_to_remove_2d))
# [0, 1, 5, 6]
df.drop(merged)
Test
2 3
3 2
4 4
indexes_2d
generates those indexes that fits the value k
in the counter with df.index[df['Test'] == k]
and has a [:v]
to limit the size of indexes we take from.
Then, we merge these indexes
with itertools.chain
. Finally drop rows with these indexes.
Thanks for Stephan Ranch to point out the order problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.