I have a dataframe:
Name url
A 'https://foo.com, https://www.bar.org, https://goo.com'
B 'https://foo.com, https://www.bar.org, https://www.goo.com'
C 'https://foo.com, https://www.bar.org, https://goo.com'
and then a keyword list:
keyword_list = ['foo','bar']
I'm trying remove the urls that contain the keywords while keeping the ones that don't, so far this is the only thing that has worked for me, however it just removes that instance of the word only:
df['url'] = df['url'].str.replace('|'.join(keywordlist), ' ')
I've tried to convert the elements in the string to a list, however I get an indexing error when combining it back with the larger dataframe its part of, anyone run into this before?
Desired output:
Name url
A 'https://goo.com'
B 'https://www.goo.com'
C 'https://goo.com'
I'm pretty sure you can do so with some regex. But you can also do:
new_df = df.set_index('Name').url.str.split(',\s+', expand=True).stack()
(new_df[~new_df.str.contains('|'.join(keyword_list))]
.reset_index(level=1, drop=True)
.to_frame(name='url')
.reset_index()
)
Output:
Name url
0 A https://goo.com
1 B https://www.goo.com
2 C https://goo.com
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.