简体   繁体   中英

Remove a SPECIFIC url from a string in a pandas dataframe

I have a dataframe:

Name  url

 A    'https://foo.com, https://www.bar.org, https://goo.com'
 B    'https://foo.com, https://www.bar.org, https://www.goo.com'
 C    'https://foo.com, https://www.bar.org, https://goo.com'

and then a keyword list:

keyword_list = ['foo','bar']

I'm trying remove the urls that contain the keywords while keeping the ones that don't, so far this is the only thing that has worked for me, however it just removes that instance of the word only:

df['url'] = df['url'].str.replace('|'.join(keywordlist), ' ')

I've tried to convert the elements in the string to a list, however I get an indexing error when combining it back with the larger dataframe its part of, anyone run into this before?

Desired output:

Name  url

 A    'https://goo.com'
 B    'https://www.goo.com'
 C    'https://goo.com'

I'm pretty sure you can do so with some regex. But you can also do:

new_df = df.set_index('Name').url.str.split(',\s+', expand=True).stack()

(new_df[~new_df.str.contains('|'.join(keyword_list))]
      .reset_index(level=1, drop=True)
      .to_frame(name='url')
      .reset_index()
)

Output:

  Name                  url
0    A      https://goo.com
1    B  https://www.goo.com
2    C      https://goo.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM