查找字符串中的所有HTML和非HTML编码的URL

Question

I would like to find all URLs in a string. 我想在一个字符串中找到所有URL。 I found various solutions on StackOverflow that vary depending on the content of the string. 我在StackOverflow上发现了各种解决方案，这些解决方案根据字符串的内容而有所不同。

For example, supposing my string contained HTML, this answer recommends using either BeautifulSoup or lxml . 例如，假设我的字符串包含HTML，则此答案建议使用BeautifulSoup或lxml 。

On the other hand, if my string contained only a plain URL without HTML tags, this answer recommends using a regular expression. 另一方面，如果我的字符串仅包含不带HTML标记的纯URL，则此答案建议使用正则表达式。

I wasn't able to find a good solution given my string contains both HTML encoded URL as well as a plain URL. 鉴于我的字符串同时包含HTML编码的URL和纯URL，因此找不到合适的解决方案。 Here is some example code: 这是一些示例代码：

import lxml.html

example_data = """<a href="http://www.some-random-domain.com/abc123/def.html">Click Me!</a>
http://www.another-random-domain.com/xyz.html"""
dom = lxml.html.fromstring(example_data)
for link in dom.xpath('//a/@href'):
    print "Found Link: ", link

As expected, this results in: 如预期的那样，将导致：

Found Link:  http://www.some-random-domain.com/abc123/def.html

I also tried the twitter-text-python library that @Yannisp mentioned, but it doesn't seem to extract both URLS: 我还尝试了@Yannisp提到的twitter-text-python库，但似乎没有提取两个URL：

>>> from ttp.ttp import Parser
>>> p = Parser()
>>> r = p.parse(example_data)
>>> r.urls
['http://www.another-random-domain.com/xyz.html']

What is the best approach for extracting both kinds of URLs from a string containing a mix of HTML and non HTML encoded data? 从包含HTML和非HTML编码数据混合的字符串中提取两种URL的最佳方法是什么？ Is there a good module that already does this? 有没有一个好的模块已经做到了？ Or am I forced to combine regex with BeautifulSoup / lxml ? 还是我被迫将regex与BeautifulSoup / lxml结合使用？

Answer 1

I upvoted because it triggered my curiosity. 我投票是因为它激发了我的好奇心。 There seems to be a library called twitter-text-python , that parses Twitter posts to detect both urls and hrefs. 似乎有一个名为twitter-text-python的库，该库解析Twitter帖子以检测url和hrefs。 Otherwise, I would go with the combination regex + lxml 否则，我会使用regex + lxml组合

Answer 2

You could use RE to find all URLs: 您可以使用RE查找所有URL：

import re
urls = re.findall("(https?://[\w\/\$\-\_\.\+\!\*\'\(\)]+)", example_data)

It's including alphanumerics, '/' and "Characters allowed in a URL" 它包括字母数字，“ /”和“ URL中允许的字符”

Answer 3

Based on the answer by @YannisP, I was able to come up with this solution: 根据@YannisP的回答，我提出了以下解决方案：

import lxml.html  
from ttp.ttp import Parser

def extract_urls(data):
    urls = set()
    # First extract HTML-encoded URLs
    dom = lxml.html.fromstring(data)
    for link in dom.xpath('//a/@href'):
        urls.add(link)
    # Next, extract URLs from plain text
    parser = Parser()
    results = parser.parse(data)
    for url in results.urls:
        urls.add(url)
    return list(urls)

This results in: 结果是：

>>> example_data
'<a href="http://www.some-random-domain.com/abc123/def.html">Click Me!</a>\nhttp://www.another-random-domain.com/xyz.html'
>>> urls = extract_urls(example_data)
>>> print urls
['http://www.another-random-domain.com/xyz.html', 'http://www.some-random-domain.com/abc123/def.html']

I'm not sure how well this will work on other URLs, but it seems to work for what I need it to do. 我不确定这在其他URL上的效果如何，但似乎可以满足我的需要。

查找字符串中的所有HTML和非HTML编码的URL

问题描述

3 个解决方案

解决方案1
1 已采纳 2015-05-19 16:25:55

解决方案2
0 2015-05-19 18:34:58

解决方案3
0 2015-05-19 18:57:17

查找字符串中的所有HTML和非HTML编码的URL

问题描述

3 个解决方案

解决方案1 1 已采纳 2015-05-19 16:25:55

解决方案2 0 2015-05-19 18:34:58

解决方案3 0 2015-05-19 18:57:17

解决方案1
1 已采纳 2015-05-19 16:25:55

解决方案2
0 2015-05-19 18:34:58

解决方案3
0 2015-05-19 18:57:17