简体   繁体   English

在熊猫数据框的列中使用正则表达式删除部分 URL 字符串

[英]remove part of URL string with regex in column of pandas dataframe

I need to clean up some urls to remove the unique tracking codes so that in reporting they can be counted in a group rather than 1000's of individual pages.我需要清理一些 url 以删除唯一的跟踪代码,以便在报告时可以将它们计入一组而不是 1000 个单独的页面。

the code to remove is in the middle of the url and varies in length.要删除的代码位于 url 的中间,并且长度不同。

example url is示例网址是

https://www.website.co.uk/product/?commcodeABBB/home-page/ https://www.website.co.uk/product/?commcodeABBB/home-page/

I am trying to get this我想得到这个

https://www.website.co.uk/product/home-page/ https://www.website.co.uk/product/home-page/

I have similar code working for removing the end of a url string:我有类似的代码用于删除 url 字符串的末尾:

df["URL"] = df["URL"].str.replace('\/id.*','/',regex=True)

I have tried to modify it for my new scenario.我试图为我的新场景修改它。

df["URL"] = df["URL"].str.replace('\/\?commcode.{0,5}','/',regex=True)

In this scenario the regex \\/\\?commcode.{0,5} does select ?commcodeABBB/ however the length of code string in my URLs vary so it won't work on everything.在这种情况下,正则表达式\\/\\?commcode.{0,5}确实选择了 ?commcodeABBB/,但是我的 URL 中代码字符串的长度各不相同,因此它不会对所有内容都起作用。

I cannot work out how to write it so that it takes everything from ?commcode up to and including the next /.我不知道如何编写它以便它需要从 ?commcode 到并包括下一个 / 的所有内容。 I looked at \\w \\W for 'in-between' however it doesn't recognise / only alphanumeric characters.我查看了 \\w \\W 的“中间”,但它不识别/仅识别字母数字字符。

I have read many many other posts about similar issues but nothing quite addresses this that I can find.我已经阅读了许多关于类似问题的其他帖子,但没有什么能完全解决我能找到的问题。 I cannot use code that counts from start or end of the string as length changes, as does the number of / in the url so I cannot use 'between 2nd and 3rd / method.我不能使用从字符串开头或结尾开始计数的代码,就像 url 中 / 的数量一样,所以我不能使用 ' between 2nd and 3rd / 方法。

Any ideas please?请问有什么想法吗?

Use

df["URL"] = df["URL"].str.replace(r'/\?commcode[^/]*', '')

See proof .证明

Explanation解释

--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  \?                       '?'
--------------------------------------------------------------------------------
  commcode                 'commcode'
--------------------------------------------------------------------------------
  [^/]*                    any character except: '/' (0 or more times
                           (matching the most amount possible))

You can do:你可以做:

'\/\?commcode[A-Za-z0-9]*'

to specify which character groups you want included.指定要包含的字符组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM