I have the below DataFrame and a Dict. It doesn't necessary to be a dictionary but the key value pairs belong together.
Now the thing is, that I'd like to drop the rows where 'company' matches the key of 'removal_dict'. And as a second condition for that same row the value in 'astring' must contain the string which is the value of that particular key. The value does not have to be a 1:1 match, it only has to contain that string.
df = pd.DataFrame({'ID': ['A', 'B', 'C', 'D', 'E', 'F'],
'company': ['BRAMSUNG', 'BRAMSUNG', 'VRENOVO', 'WRAPPLE', 'PIRCOSOFT', 'PIRCOSOFT'],
'astring': ['BRAMSUNG MAINSTREET SEOUL', 'BRAMSUNG SUBSTREET SEOUL', 'LOOKING FOR VRENOVO IN BRAMSUNG MAINSTREET', 'I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET', 'PIRCOSOFT ACCOUNT NR. 1222', 'DEPOSIT TO PIRCOSOFT ACCOUNT NOW']
})
removal_dict = {'BRAMSUNG': 'BRAMSUNG MAINSTREET',
'PIRCOSOFT': 'PIRCOSOFT ACCOUNT NR.',
'VRENOVO': 'LOOKING FOR VRENOVO'
}
>>> df
ID company astring
0 A BRAMSUNG BRAMSUNG MAINSTREET SEOUL
1 B BRAMSUNG BRAMSUNG SUBSTREET SEOUL
2 C VRENOVO LOOKING FOR VRENOVO IN BRAMSUNG MAINSTREET
3 D WRAPPLE I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET
4 E PIRCOSOFT PIRCOSOFT ACCOUNT NR. 1222
5 F PIRCOSOFT DEPOSIT TO PIRCOSOFT ACCOUNT NOW
Thus, ID's A, C and E should be dropped.
Example: ID A must be dropped because there is a key BRAMSUNG and a value BRAMSUNG MAINSTREET in removal_dict. On the other hand ID B mustn't be dropped because there is only the key matching, but no value.
Expected result should be:
>>> df
ID company astring
1 B BRAMSUNG BRAMSUNG SUBSTREET SEOUL
3 D WRAPPLE I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET
5 F PIRCOSOFT DEPOSIT TO PIRCOSOFT ACCOUNT NOW
Create a data frame containing all of the string start possibilities:
temp_df = df['astring'].to_frame() \
.merge(pd.Series(removal_dict.values(), name='contains'), how='cross')
Then create a data frame containing the entries that match your removal rule.
removals = temp_df[
temp_df.apply(lambda r: r['astring'].startswith(r['contains']), axis=1)]
I can't think of an obvious way to avoid the apply
loop for str.startswith
. Using pd.Series.str.startswith
does not help because it is meant to apply against a single string, rather apply element-wise against two columns of strings.
Regardless, just subset:
>>> df[~df['astring'].isin(removals['astring'])]
ID company astring
1 B BRAMSUNG BRAMSUNG SUBSTREET SEOUL
3 D WRAPPLE I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET
5 F PIRCOSOFT DEPOSIT TO PIRCOSOFT ACCOUNT NOW
Seems to me that you'll need explicit iteration at some point in this solution either via .apply(..., axis=1)
or some explicit iteration.
row-wise apply solution
def check_removal(row, removal_dict):
"""returns True where the aligned removal string is in df["astring"]"""
removal_string = removal_dict.get(row["company"])
if removal_string is None:
return False
return removal_string in row["astring"]
mask = df.apply(check_removal, removal_dict=removal_dict, axis=1)
new_df = df.loc[~mask]
print(new_df)
ID company astring
1 B BRAMSUNG BRAMSUNG SUBSTREET SEOUL
3 D WRAPPLE I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET
5 F PIRCOSOFT DEPOSIT TO PIRCOSOFT ACCOUNT NOW
split-apply-combine (groupby) solution if you have a small number of large groups (companies), this solution should be faster than a row-wise apply.
If you have a small number of unique groups, then this performance should be similar to row-wise apply.
pieces = []
for group_name, group_data in df.groupby("company"):
# if this group is not in removal_dict, keep everything
if group_name not in removal_dict.keys():
mask = np.ones(group_data.shape[0], dtype=bool)
# if this group_name is in removal_dict, comapre w/ str.contains
else:
mask = ~group_data["astring"].str.contains(removal_dict[group_name])
matches = group_data.loc[mask]
if not matches.empty:
pieces.append(matches)
new_df = pd.concat(pieces).sort_index()
print(new_df)
ID company astring
1 B BRAMSUNG BRAMSUNG SUBSTREET SEOUL
3 D WRAPPLE I GO FOR WRAPPLE IN BRAMSUNG MAINSTREET
5 F PIRCOSOFT DEPOSIT TO PIRCOSOFT ACCOUNT NOW
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.