简体   繁体   中英

Pandas: Drop rows conditionally when 1 column matches key and another column contains value

I have the below DataFrame and a Dict. It doesn't necessary to be a dictionary but the key value pairs belong together.

Now the thing is, that I'd like to drop the rows where 'company' matches the key of 'removal_dict'. And as a second condition for that same row the value in 'astring' must contain the string which is the value of that particular key. The value does not have to be a 1:1 match, it only has to contain that string.

df = pd.DataFrame({'ID': ['A', 'B', 'C', 'D', 'E', 'F'], 
        'company': ['BRAMSUNG', 'BRAMSUNG', 'VRENOVO', 'WRAPPLE', 'PIRCOSOFT', 'PIRCOSOFT'],
        'astring': ['BRAMSUNG MAINSTREET SEOUL', 'BRAMSUNG SUBSTREET SEOUL', 'LOOKING FOR VRENOVO IN BRAMSUNG MAINSTREET', 'I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET', 'PIRCOSOFT ACCOUNT NR. 1222', 'DEPOSIT TO PIRCOSOFT ACCOUNT NOW']
       })

removal_dict = {'BRAMSUNG': 'BRAMSUNG MAINSTREET',
                'PIRCOSOFT': 'PIRCOSOFT ACCOUNT NR.',
                'VRENOVO': 'LOOKING FOR VRENOVO'
                }
>>> df
                 
    ID  company    astring             
0   A   BRAMSUNG   BRAMSUNG MAINSTREET SEOUL
1   B   BRAMSUNG   BRAMSUNG SUBSTREET SEOUL
2   C   VRENOVO    LOOKING FOR VRENOVO IN BRAMSUNG MAINSTREET
3   D   WRAPPLE    I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET
4   E   PIRCOSOFT  PIRCOSOFT ACCOUNT NR. 1222
5   F   PIRCOSOFT  DEPOSIT TO PIRCOSOFT ACCOUNT NOW

Thus, ID's A, C and E should be dropped.

Example: ID A must be dropped because there is a key BRAMSUNG and a value BRAMSUNG MAINSTREET in removal_dict. On the other hand ID B mustn't be dropped because there is only the key matching, but no value.

Expected result should be:

>>> df
                 
    ID  company    astring             
1   B   BRAMSUNG   BRAMSUNG SUBSTREET SEOUL
3   D   WRAPPLE    I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET
5   F   PIRCOSOFT  DEPOSIT TO PIRCOSOFT ACCOUNT NOW

Create a data frame containing all of the string start possibilities:

temp_df = df['astring'].to_frame() \
    .merge(pd.Series(removal_dict.values(), name='contains'), how='cross')

Then create a data frame containing the entries that match your removal rule.

removals = temp_df[
    temp_df.apply(lambda r: r['astring'].startswith(r['contains']), axis=1)]

I can't think of an obvious way to avoid the apply loop for str.startswith . Using pd.Series.str.startswith does not help because it is meant to apply against a single string, rather apply element-wise against two columns of strings.

Regardless, just subset:

>>> df[~df['astring'].isin(removals['astring'])]
  ID    company                                  astring
1  B   BRAMSUNG                 BRAMSUNG SUBSTREET SEOUL
3  D    WRAPPLE  I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET
5  F  PIRCOSOFT         DEPOSIT TO PIRCOSOFT ACCOUNT NOW

Seems to me that you'll need explicit iteration at some point in this solution either via .apply(..., axis=1) or some explicit iteration.

row-wise apply solution

def check_removal(row, removal_dict):
    """returns True where the aligned removal string is in df["astring"]"""
    removal_string = removal_dict.get(row["company"])
    if removal_string is None:
        return False
    return removal_string in row["astring"]

mask = df.apply(check_removal, removal_dict=removal_dict, axis=1)
new_df = df.loc[~mask]

print(new_df)
  ID    company                                  astring
1  B   BRAMSUNG                 BRAMSUNG SUBSTREET SEOUL
3  D    WRAPPLE  I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET
5  F  PIRCOSOFT         DEPOSIT TO PIRCOSOFT ACCOUNT NOW

split-apply-combine (groupby) solution if you have a small number of large groups (companies), this solution should be faster than a row-wise apply.

If you have a small number of unique groups, then this performance should be similar to row-wise apply.

pieces = []
for group_name, group_data in df.groupby("company"):
    # if this group is not in removal_dict, keep everything
    if group_name not in removal_dict.keys():
        mask = np.ones(group_data.shape[0], dtype=bool)

    # if this group_name is in removal_dict, comapre w/ str.contains
    else:
        mask = ~group_data["astring"].str.contains(removal_dict[group_name])

    matches = group_data.loc[mask]
    if not matches.empty:
        pieces.append(matches)

new_df = pd.concat(pieces).sort_index()
print(new_df)
  ID    company                                  astring
1  B   BRAMSUNG                 BRAMSUNG SUBSTREET SEOUL
3  D    WRAPPLE  I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET
5  F  PIRCOSOFT         DEPOSIT TO PIRCOSOFT ACCOUNT NOW

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM