简体   繁体   English

如何使用 Python 从 HTML 获取 href 链接?

[英]How can I get href links from HTML using Python?

import urllib2

website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.到现在为止还挺好。

But I want only href links from the plain text HTML.但我只想要来自纯文本 HTML 的 href 链接。 How can I solve this problem?我怎么解决这个问题?

Try with Beautifulsoup :尝试使用Beautifulsoup

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

In case you just want links starting with http:// , you should use:如果您只想要以http://开头的链接,您应该使用:

soup.findAll('a', attrs={'href': re.compile("^http://")})

In Python 3 with BS4 it should be:在带有 BS4 的 Python 3 中,它应该是:

from bs4 import BeautifulSoup
import urllib.request

html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
    print(link.get('href'))

You can use the HTMLParser module.您可以使用HTMLParser模块。

The code would probably look something like this:代码可能如下所示:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        # Only parse the 'anchor' tag.
        if tag == "a":
           # Check the list of defined attributes.
           for name, value in attrs:
               # If href is defined, print it.
               if name == "href":
                   print name, "=", value


parser = MyHTMLParser()
parser.feed(your_html_string)

Note: The HTMLParser module has been renamed to html.parser in Python 3.0.注意: HTMLParser 模块已在 Python 3.0 中重命名为 html.parser。 The 2to3 tool will automatically adapt imports when converting your sources to 3.0.将源转换为 3.0 时,2to3 工具将自动调整导入。

Look at using the beautiful soup html parsing library.看看使用漂亮的soup html解析库。

http://www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/

You will do something like this:你会做这样的事情:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
    print link.get("href")

Using BS4 for this specific task seems overkill.将 BS4 用于此特定任务似乎有点过分。

Try instead:试试吧:

website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))

I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-findall and works for me quite well.我在http://www.pythonforbeginners.com/code/regular-expression-re-findall上找到了这段漂亮的代码,对我来说效果很好。

I tested it only on my scenario of extracting a list of files from a web folder that exposes the files\\folder in it, eg:我仅在从公开文件\\文件夹的 Web 文件夹中提取文件列表的场景中对其进行了测试,例如:

在此处输入图片说明

and I got a sorted list of the files\\folders under the URL我得到了 URL 下文件\\文件夹的排序列表

My answer probably sucks compared to the real gurus out there, but using some simple math, string slicing, find and urllib, this little script will create a list containing link elements.与真正的大师相比,我的答案可能很糟糕,但是使用一些简单的数学、字符串切片、查找和 urllib,这个小脚本将创建一个包含链接元素的列表。 I test google and my output seems right.我测试了谷歌,我的输出似乎是正确的。 Hope it helps!希望能帮助到你!

import urllib
test = urllib.urlopen("http://www.google.com").read()
sane = 0
needlestack = []
while sane == 0:
  curpos = test.find("href")
  if curpos >= 0:
    testlen = len(test)
    test = test[curpos:testlen]
    curpos = test.find('"')
    testlen = len(test)
    test = test[curpos+1:testlen]
    curpos = test.find('"')
    needle = test[0:curpos]
    if needle.startswith("http" or "www"):
        needlestack.append(needle)
  else:
    sane = 1
for item in needlestack:
  print item

Here's a lazy version of @stephen's answer这是@stephen 答案的懒惰版本

import html.parser
import itertools
import urllib.request

class LinkParser(html.parser.HTMLParser):
    def reset(self):
        super().reset()
        self.links = iter([])

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for (name, value) in attrs:
                if name == 'href':
                    self.links = itertools.chain(self.links, [value])


def gen_links(stream, parser):
    encoding = stream.headers.get_content_charset() or 'UTF-8'
    for line in stream:
        parser.feed(line.decode(encoding))
        yield from parser.links

Use it like so:像这样使用它:

>>> parser = LinkParser()
>>> stream = urllib.request.urlopen('http://stackoverflow.com/questions/3075550')
>>> links = gen_links(stream, parser)
>>> next(links)
'//stackoverflow.com'

Using requests with BeautifulSoup and Python 3:在 BeautifulSoup 和 Python 3 中使用请求:

import requests 
from bs4 import BeautifulSoup


page = requests.get('http://www.website.com')
bs = BeautifulSoup(page.content, features='lxml')
for link in bs.findAll('a'):
    print(link.get('href'))

This is way late to answer but it will work for latest python users:这是回答晚了,但它适用于最新的 python 用户:

from bs4 import BeautifulSoup
import requests 


html_page = requests.get('http://www.example.com').text

soup = BeautifulSoup(html_page, "lxml")
for link in soup.findAll('a'):
    print(link.get('href'))

Don't forget to install " requests " and " BeautifulSoup " package and also " lxml ".不要忘记安装“ requests ”和“ BeautifulSoup ”包以及“ lxml ”。 Use .text along with get otherwise it will throw an exception.将 .text 与 get 一起使用,否则会抛出异常。

" lxml " is used to remove that warning of which parser to be used. lxml ”用于删除要使用哪个解析器的警告。 You can also use " html.parser " whichever fits your case.您也可以使用“ html.parser ”适合您的情况。

This answer is similar to others with requests and BeautifulSoup , but using list comprehension.这个答案类似于其他带有requestsBeautifulSoup答案,但使用列表理解。

Because find_all() is the most popular method in the Beautiful Soup search API, you can use soup("a") as a shortcut of soup.findAll("a") and using list comprehension:因为find_all()是 Beautiful Soup 搜索 API 中最流行的方法,所以您可以使用soup("a")作为soup.findAll("a")的快捷方式并使用列表soup.findAll("a")

import requests
from bs4 import BeautifulSoup

URL = "http://www.yourwebsite.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, features='lxml')
# Find links
all_links = [link.get("href") for link in soup("a")]
# Only external links
ext_links = [link.get("href") for link in soup("a") if "http" in link.get("href")]

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all

Simplest way for me:对我来说最简单的方法:

from urlextract import URLExtract
from requests import get

url = "sample.com/samplepage/"
req = requests.get(url)
text = req.text
# or if you already have the html source:
# text = "This is html for ex <a href='http://google.com/'>Google</a> <a href='http://yahoo.com/'>Yahoo</a>"
text = text.replace(' ', '').replace('=','')
extractor = URLExtract()
print(extractor.find_urls(text))

output:输出:

['http://google.com/', 'http://yahoo.com/']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当使用 Python 和 Selenium 抓取 web 时,如何从单个页面获取所有 href 链接? - How can I get all the href links from a single page when web scraping using Python and Selenium? 如何使用 python/pandas 从 href 获取 href 链接 - How do I get href links from href using python/pandas 如何使用 Python 从网页中获取 href 链接? - How to get href links from a webpage using Python? 使用 BeautifulSoup + Python 从列表中获取所有 href 标签和链接 - Get all href tags and links from a list using BeautifulSoup + Python 如何在python中使用beautifulsoup获取完整的href链接 - How to get complete href links using beautifulsoup in python 如何通过使用python 3从“ a”中的href获得类名的链接 - How can I get the link from href in “a” with class name by using python 3 如何使用python仅从解析的html中获取链接? - How to get only links from parsed html using python? 如何使用 Python Selenium 从 HTML DOM 打印更多链接? - How can I print more links from the HTML DOM using Python Selenium? 我如何使用python tornado从html获取数据 - How can I get data from html using python tornado 如何使用 python 从 class 获取 href? - How to get href from class using python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM