I Have a HTML string,
I was surfing http://www.google.com, where I found my tweet,
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
<span>http://www.google.com</span>
to this,
I was surfing <a href="http://www.google.com">http://www.google.com</a>, where I found my tweet,
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
<span><a href="http://www.google.com">http://www.google.com</a></span>
I try this Demo
my python code is
import re
p = re.compile(ur'<a\b[^>]*>.*?</a>|((ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?)', re.MULTILINE)
test_str = u"I was surfing http://www.google.com, where I found my tweet, check it out <a href=\"http://tinyurl.com/blah\">http://tinyurl.com/blah</a>"
for item in re.finditer(p, test_str):
print item.group(0)
Output:
>>> http://www.google.com,
>>> <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
I hope this can help you.
Code:
import re
p = re.compile(ur'''[^<">]((ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?)[^< ,"'>]''', re.MULTILINE)
test_str = u"I was surfing http://www.google.com, where I found my tweet, check it out <a href=\"http://tinyurl.com/blah\">http://tinyurl.com/blah</a>"
for item in re.finditer(p, test_str):
result = item.group(0)
result = result.replace(' ', '')
print result
end_result = test_str.replace(result, '<a href="' + result + '">' + result + '</a>')
print end_result
Output:
http://www.google.com
I was surfing <a href="http://www.google.com">http://www.google.com</a>, where I found my tweet, check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
You could make the regex more complex, but as mikus suggested, it seems easier to do the following:
for item in re.finditer(p, test_str):
result = item.group(0)
if not "<a " in result.lower():
print(result)
Ok, I think I finally found what you're looking for. The basic idea is to try to match <a href
and a URL. If there is an <a href
then don't do anything, but if there is not then add the link. Here is the code:
import re
test_str = """I was surfing http://www.google.com, where I found my tweet,
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
<span>http://www.google.com</span>
"""
def repl_func(matchObj):
href_tag, url = matchObj.groups()
if href_tag:
# Since it has an href tag, this isn't what we want to change,
# so return the whole match.
return matchObj.group(0)
else:
return '<a href="%s">%s</a>' % (url, url)
pattern = re.compile(
r'((?:<a href[^>]+>)|(?:<a href="))?'
r'((?:https?):(?:(?://)|(?:\\\\))+'
r"(?:[\w\d:#@%/;$()~_?\+\-=\\\.&](?:#!)?)*)",
flags=re.IGNORECASE)
result = re.sub(pattern, repl_func, test_str)
print(result)
Output:
I was surfing <a href="http://www.google.com">http://www.google.com</a>, where I found my tweet,
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
<span><a href="http://www.google.com">http://www.google.com</a></span>
The main idea is from https://stackoverflow.com/a/3580700/5100564 . I also borrowed from https://stackoverflow.com/a/6718696/5100564 .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.