简体   繁体   English

正则表达式删除单词和查找单词

[英]Regex deleting word and finding word

I would like to find and delete the https part in a sentence.我想在一句话中找到并删除https部分。 I use re.search("^https://t.co/.*[a-zA-Z]" ,data)` and the result is:我使用re.search("^https://t.co/.*[a-zA-Z]" ,data)` 结果是:

match='https://xx.x/ekGSeJufuH 7 jalan indonesia yang pa

match='https://xx.x/okbymT3g'

But I want to just take match='https://xx.x/ekGSeJufuH and delete while keeping the rest of the word.但我只想取match='https://xx.x/ekGSeJufuH并删除,同时保留单词的 rest。 I there something wrong with my regex?我的正则表达式有问题吗?

.* matches any characters including whitespace. .* 匹配任何字符,包括空格。

An easier way is that一个更简单的方法是

  1. find a sentense starting with 'https://',找到以“https://”开头的句子,
  2. find the first whitespace(' ') in the sentence,找到句子中的第一个空格(''),
  3. delete substring before the whitespace.删除空格前的 substring。

I think it works because the URL doesn't allow any WS inside.我认为它有效,因为 URL 不允许内部有任何 WS。

From what I understand, you just want to exclude the "https://" from the string.据我了解,您只想从字符串中排除"https://" If so, this may be a regular expression you're looking for:如果是这样,这可能是您正在寻找的正则表达式:

r"https://(.*)"

Using the above regular expression with the addresses you provided:将上述正则表达式与您提供的地址一起使用:

>>> regex = re.compile(r"https://(.*)")
>>> regex.search("https://xx.x/ekGSeJufuH 7 jalan indonesia yang pa").group(1)
'xx.x/ekGSeJufuH 7 jalan indonesia yang pa'
>>> regex.search("https://xx.x/okbymT3g").group(1)
'xx.x/okbymT3g'

If there are more criteria for the regular expression which I missed, just comment on my answer and I'll update the regular expression accordingly.如果我错过的正则表达式有更多标准,只需评论我的答案,我会相应地更新正则表达式。

I tried to tweak it a little bit and solves it with我试着稍微调整一下并解决它

re.search("^https://t.co/\S*",txt)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM