简体   繁体   中英

How to match everything after a certain word

Hello I am trying to match everything after "http:" and get rid of it.

Example strings like:

New species of fish found at Arkansas http: //t.co/E218nP6DZd

A new fish discovered in Arkansas ( PIGFISH ) http: //t.co/qqoMmHVItg

Expected result:

New species of fish found at Arkansas

A new fish discovered in Arkansas ( PIGFISH )

Thanks :)

A different way to approach this is to split the string on your target word and return the first part.

my_string="New species of fish found at Arkansas http://example"
print(my_string.split("http",1)[0])
#New species of fish found at Arkansas 

You can call the index() function on your string, which will return the index of the first occurrence of the passed in substring. You can use this to directly slice the part you want:

s = "New species of fish found at Arkansas http: //example.com/E218nP6DZd"

s[:s.index('http')]
# 'New species of fish found at Arkansas '

You need a regex that catches what is before the http , you may use search/match and print the capturing group, or use findall , you'll end with the same result

values = ["New species of fish found at Arkansas http: //urlshorten",
          "A new fish discovered in Arkansas ( PIGFISH ) http: //urlshorten"]

reg = re.compile("(.*)http")
for value in values:
    txt = reg.findall(value)
    print(txt)

    txt = reg.search(value) # or match
    print(txt.groups())
import re

web_string = 'A new fish discovered in Arkansas ( PIGFISH ) http: //website.com/qqoMmHVItg'
match_group = re.match('(.*\( PIGFISH \)) (http.*$)', web_string)

no_http_string = match_group[1]
print(no_http_string)

should yield you

A new fish discovered in Arkansas ( PIGFISH )

You can always use a regex to match with a url.

import re
if text.search("http"):
    #code

As azro said, easier to capture what is before "http:", and ignore the rest.

Here is a regex I tried using the re package that captures ( ... ) any alphanumeric \\w or whitespace \\s at the start of the string, but the text "http" and any number of any type of characters afterwards .* are not included in the captured group.

([\w\s]*)http.*

[\\w\\s]* matches any number of alphanumerics or spaces

() includes that in a capture group

http.* matches the exact text "http" and any number of any character afterwards.

Here is the python code I ran on your string:

s = "New species of fish found at Arkansas https://twitter.com/oliviadodson_/status/445043948969398272/photo/1"
>>> import re
>>> pat = re.compile(  r'([\w\s]*)http.*'  )
>>> m = pat.search( s ); print(m)
>>> m.group(1)
'New species of fish found at Arkansas '

This only works on a single line of the text at a time (doesn't include newlines at the end). You can modify it to fit your exact use case, for example including punctuation in the capture etc. Use a for loop to iterate through the paragraph etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM