[英]How to match everything after a certain word
Hello I am trying to match everything after "http:" and get rid of it.您好,我正在尝试匹配“http:”之后的所有内容并摆脱它。
Example strings like:示例字符串如:
New species of fish found at Arkansas http: //t.co/E218nP6DZd
在阿肯色州发现的新鱼种 http://t.co/E218nP6DZd
A new fish discovered in Arkansas ( PIGFISH ) http: //t.co/qqoMmHVItg
在阿肯色州发现的一种新鱼 ( PIGFISH ) http://t.co/qqoMmHVItg
Expected result:预期结果:
New species of fish found at Arkansas
在阿肯色州发现的新鱼种
A new fish discovered in Arkansas ( PIGFISH )
在阿肯色州发现的一种新鱼 ( PIGFISH )
Thanks :)谢谢 :)
A different way to approach this is to split the string on your target word and return the first part.解决此问题的另一种方法是拆分目标单词上的字符串并返回第一部分。
my_string="New species of fish found at Arkansas http://example"
print(my_string.split("http",1)[0])
#New species of fish found at Arkansas
You can call the index()
function on your string, which will return the index of the first occurrence of the passed in substring.您可以对字符串调用
index()
函数,该函数将返回传入的子字符串第一次出现的索引。 You can use this to directly slice the part you want:您可以使用它来直接切片您想要的部分:
s = "New species of fish found at Arkansas http: //example.com/E218nP6DZd"
s[:s.index('http')]
# 'New species of fish found at Arkansas '
You need a regex that catches what is before the http
, you may use search/match
and print the capturing group, or use findall
, you'll end with the same result您需要一个正则表达式来捕获
http
之前的内容,您可以使用search/match
并打印捕获组,或使用findall
,您将得到相同的结果
values = ["New species of fish found at Arkansas http: //urlshorten",
"A new fish discovered in Arkansas ( PIGFISH ) http: //urlshorten"]
reg = re.compile("(.*)http")
for value in values:
txt = reg.findall(value)
print(txt)
txt = reg.search(value) # or match
print(txt.groups())
import re
web_string = 'A new fish discovered in Arkansas ( PIGFISH ) http: //website.com/qqoMmHVItg'
match_group = re.match('(.*\( PIGFISH \)) (http.*$)', web_string)
no_http_string = match_group[1]
print(no_http_string)
should yield you应该让你
A new fish discovered in Arkansas ( PIGFISH )
You can always use a regex to match with a url.您始终可以使用正则表达式来匹配 url。
import re
if text.search("http"):
#code
As azro said, easier to capture what is before "http:", and ignore the rest.正如azro所说,更容易捕获“http:”之前的内容,而忽略其余部分。
Here is a regex I tried using the re
package that captures (
... )
any alphanumeric \\w
or whitespace \\s
at the start of the string, but the text "http" and any number of any type of characters afterwards .*
are not included in the captured group.这是我尝试使用
re
包捕获(
... )
字符串开头的任何字母数字\\w
或空格\\s
的正则表达式,但文本“http”和之后的任何数量的任何类型的字符.*
是不包括在捕获组中。
([\w\s]*)http.*
[\\w\\s]*
matches any number of alphanumerics or spaces [\\w\\s]*
匹配任意数量的字母数字或空格
()
includes that in a capture group ()
包括在捕获组中
http.*
matches the exact text "http" and any number of any character afterwards. http.*
匹配确切的文本“http”和之后的任意数量的任何字符。
Here is the python code I ran on your string:这是我在你的字符串上运行的 python 代码:
s = "New species of fish found at Arkansas https://twitter.com/oliviadodson_/status/445043948969398272/photo/1"
>>> import re
>>> pat = re.compile( r'([\w\s]*)http.*' )
>>> m = pat.search( s ); print(m)
>>> m.group(1)
'New species of fish found at Arkansas '
This only works on a single line of the text at a time (doesn't include newlines at the end).这一次仅适用于文本的一行(不包括末尾的换行符)。 You can modify it to fit your exact use case, for example including punctuation in the capture etc. Use a
for
loop to iterate through the paragraph etc.您可以修改它以适合您的确切用例,例如在捕获中包含标点符号等。使用
for
循环遍历段落等。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.