Remove a SPECIFIC url from a string in a pandas dataframe

Question

I have a dataframe:

Name  url

 A    'https://foo.com, https://www.bar.org, https://goo.com'
 B    'https://foo.com, https://www.bar.org, https://www.goo.com'
 C    'https://foo.com, https://www.bar.org, https://goo.com'

and then a keyword list:

keyword_list = ['foo','bar']

I'm trying remove the urls that contain the keywords while keeping the ones that don't, so far this is the only thing that has worked for me, however it just removes that instance of the word only:

df['url'] = df['url'].str.replace('|'.join(keywordlist), ' ')

I've tried to convert the elements in the string to a list, however I get an indexing error when combining it back with the larger dataframe its part of, anyone run into this before?

Desired output:

Name  url

 A    'https://goo.com'
 B    'https://www.goo.com'
 C    'https://goo.com'

Answer 1

I'm pretty sure you can do so with some regex. But you can also do:

new_df = df.set_index('Name').url.str.split(',\s+', expand=True).stack()

(new_df[~new_df.str.contains('|'.join(keyword_list))]
      .reset_index(level=1, drop=True)
      .to_frame(name='url')
      .reset_index()
)

Output:

  Name                  url
0    A      https://goo.com
1    B  https://www.goo.com
2    C      https://goo.com

Remove a SPECIFIC url from a string in a pandas dataframe

Question

1 answers

solution1
0 2019-06-11 19:44:50

Remove a SPECIFIC url from a string in a pandas dataframe

Question

1 answers

solution1 0 2019-06-11 19:44:50

solution1
0 2019-06-11 19:44:50