How to match everything after a certain word

Question

Hello I am trying to match everything after "http:" and get rid of it.

Example strings like:

New species of fish found at Arkansas http: //t.co/E218nP6DZd

A new fish discovered in Arkansas ( PIGFISH ) http: //t.co/qqoMmHVItg

Expected result:

New species of fish found at Arkansas

A new fish discovered in Arkansas ( PIGFISH )

Thanks :)

Answer 1

A different way to approach this is to split the string on your target word and return the first part.

my_string="New species of fish found at Arkansas http://example"
print(my_string.split("http",1)[0])
#New species of fish found at Arkansas

Answer 2

You can call the index() function on your string, which will return the index of the first occurrence of the passed in substring. You can use this to directly slice the part you want:

s = "New species of fish found at Arkansas http: //example.com/E218nP6DZd"

s[:s.index('http')]
# 'New species of fish found at Arkansas '

Answer 3

You need a regex that catches what is before the http , you may use search/match and print the capturing group, or use findall , you'll end with the same result

values = ["New species of fish found at Arkansas http: //urlshorten",
          "A new fish discovered in Arkansas ( PIGFISH ) http: //urlshorten"]

reg = re.compile("(.*)http")
for value in values:
    txt = reg.findall(value)
    print(txt)

    txt = reg.search(value) # or match
    print(txt.groups())

Answer 4

import re

web_string = 'A new fish discovered in Arkansas ( PIGFISH ) http: //website.com/qqoMmHVItg'
match_group = re.match('(.*\( PIGFISH \)) (http.*$)', web_string)

no_http_string = match_group[1]
print(no_http_string)

should yield you

A new fish discovered in Arkansas ( PIGFISH )

Answer 5

You can always use a regex to match with a url.

import re
if text.search("http"):
    #code

Answer 6

As azro said, easier to capture what is before "http:", and ignore the rest.

Here is a regex I tried using the re package that captures ( ... ) any alphanumeric \\w or whitespace \\s at the start of the string, but the text "http" and any number of any type of characters afterwards .* are not included in the captured group.

([\w\s]*)http.*

[\\w\\s]* matches any number of alphanumerics or spaces

() includes that in a capture group

http.* matches the exact text "http" and any number of any character afterwards.

Here is the python code I ran on your string:

s = "New species of fish found at Arkansas https://twitter.com/oliviadodson_/status/445043948969398272/photo/1"
>>> import re
>>> pat = re.compile(  r'([\w\s]*)http.*'  )
>>> m = pat.search( s ); print(m)
>>> m.group(1)
'New species of fish found at Arkansas '

This only works on a single line of the text at a time (doesn't include newlines at the end). You can modify it to fit your exact use case, for example including punctuation in the capture etc. Use a for loop to iterate through the paragraph etc.

How to match everything after a certain word

Question

6 answers

solution1
2 ACCPTED 2020-01-01 22:00:10

solution2
2 2020-01-01 22:06:35

solution3
1 2020-01-01 21:55:58

solution4
1 2020-01-01 22:05:46

solution5
0 2020-01-01 21:53:24

solution6
0 2020-01-01 22:29:44

How to match everything after a certain word

Question

6 answers

solution1 2 ACCPTED 2020-01-01 22:00:10

solution2 2 2020-01-01 22:06:35

solution3 1 2020-01-01 21:55:58

solution4 1 2020-01-01 22:05:46

solution5 0 2020-01-01 21:53:24

solution6 0 2020-01-01 22:29:44

solution1
2 ACCPTED 2020-01-01 22:00:10

solution2
2 2020-01-01 22:06:35

solution3
1 2020-01-01 21:55:58

solution4
1 2020-01-01 22:05:46

solution5
0 2020-01-01 21:53:24

solution6
0 2020-01-01 22:29:44