简体   繁体   中英

remove part of URL string with regex in column of pandas dataframe

I need to clean up some urls to remove the unique tracking codes so that in reporting they can be counted in a group rather than 1000's of individual pages.

the code to remove is in the middle of the url and varies in length.

example url is

https://www.website.co.uk/product/?commcodeABBB/home-page/

I am trying to get this

https://www.website.co.uk/product/home-page/

I have similar code working for removing the end of a url string:

df["URL"] = df["URL"].str.replace('\/id.*','/',regex=True)

I have tried to modify it for my new scenario.

df["URL"] = df["URL"].str.replace('\/\?commcode.{0,5}','/',regex=True)

In this scenario the regex \\/\\?commcode.{0,5} does select ?commcodeABBB/ however the length of code string in my URLs vary so it won't work on everything.

I cannot work out how to write it so that it takes everything from ?commcode up to and including the next /. I looked at \\w \\W for 'in-between' however it doesn't recognise / only alphanumeric characters.

I have read many many other posts about similar issues but nothing quite addresses this that I can find. I cannot use code that counts from start or end of the string as length changes, as does the number of / in the url so I cannot use 'between 2nd and 3rd / method.

Any ideas please?

Use

df["URL"] = df["URL"].str.replace(r'/\?commcode[^/]*', '')

See proof .

Explanation

--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  \?                       '?'
--------------------------------------------------------------------------------
  commcode                 'commcode'
--------------------------------------------------------------------------------
  [^/]*                    any character except: '/' (0 or more times
                           (matching the most amount possible))

You can do:

'\/\?commcode[A-Za-z0-9]*'

to specify which character groups you want included.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM