Hello I am trying to match everything after "http:" and get rid of it.
Example strings like:
New species of fish found at Arkansas http: //t.co/E218nP6DZd
A new fish discovered in Arkansas ( PIGFISH ) http: //t.co/qqoMmHVItg
Expected result:
New species of fish found at Arkansas
A new fish discovered in Arkansas ( PIGFISH )
Thanks :)
A different way to approach this is to split the string on your target word and return the first part.
my_string="New species of fish found at Arkansas http://example"
print(my_string.split("http",1)[0])
#New species of fish found at Arkansas
You can call the index()
function on your string, which will return the index of the first occurrence of the passed in substring. You can use this to directly slice the part you want:
s = "New species of fish found at Arkansas http: //example.com/E218nP6DZd"
s[:s.index('http')]
# 'New species of fish found at Arkansas '
You need a regex that catches what is before the http
, you may use search/match
and print the capturing group, or use findall
, you'll end with the same result
values = ["New species of fish found at Arkansas http: //urlshorten",
"A new fish discovered in Arkansas ( PIGFISH ) http: //urlshorten"]
reg = re.compile("(.*)http")
for value in values:
txt = reg.findall(value)
print(txt)
txt = reg.search(value) # or match
print(txt.groups())
import re
web_string = 'A new fish discovered in Arkansas ( PIGFISH ) http: //website.com/qqoMmHVItg'
match_group = re.match('(.*\( PIGFISH \)) (http.*$)', web_string)
no_http_string = match_group[1]
print(no_http_string)
should yield you
A new fish discovered in Arkansas ( PIGFISH )
You can always use a regex to match with a url.
import re
if text.search("http"):
#code
As azro said, easier to capture what is before "http:", and ignore the rest.
Here is a regex I tried using the re
package that captures (
... )
any alphanumeric \\w
or whitespace \\s
at the start of the string, but the text "http" and any number of any type of characters afterwards .*
are not included in the captured group.
([\w\s]*)http.*
[\\w\\s]*
matches any number of alphanumerics or spaces
()
includes that in a capture group
http.*
matches the exact text "http" and any number of any character afterwards.
Here is the python code I ran on your string:
s = "New species of fish found at Arkansas https://twitter.com/oliviadodson_/status/445043948969398272/photo/1"
>>> import re
>>> pat = re.compile( r'([\w\s]*)http.*' )
>>> m = pat.search( s ); print(m)
>>> m.group(1)
'New species of fish found at Arkansas '
This only works on a single line of the text at a time (doesn't include newlines at the end). You can modify it to fit your exact use case, for example including punctuation in the capture etc. Use a for
loop to iterate through the paragraph etc.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.