简体   繁体   English

Python - 从网站中提取某些链接

[英]Python – Extract certain links from website

I want to extract certain links from a website .我想从网站中提取某些链接。

To extract all links, I tried:为了提取所有链接,我试过:

import urllib
import xml.etree.ElementTree as ET
from BeautifulSoup import *

url = 'http://pdok.bundestag.de/index.php?qsafe=&aload=off&q=kleine+anfrage&x=0&y=0&df=22.10.2013&dt=13.01.2016'
uh = urllib.urlopen(url)
data = uh.read()
soup=BeautifulSoup(data)
soup.prettify()

for href in soup.findAll('a'):
    print href

Now, I get a list of links, but for some reason I don't get the important links that are in tbody .现在,我得到了一个链接列表,但由于某种原因,我没有得到tbody的重要链接。 I also tried using ElementTree, but I get an error just reading the link, because it uses some invalid symbols or so (?).我也尝试使用 ElementTree,但我在读取链接时出错,因为它使用了一些无效的符号(?)。 Any help is much appreciated!任何帮助深表感谢! :) :)

urllib loads the HTML of the website with Javascript off . urllib关闭Javascript 的情况加载网站的 HTML。 The links that you are trying to scrape in the tbody are rendered by JavaScript, so never load.您尝试在tbody中抓取的链接由 JavaScript 呈现,因此永远不要加载。

You can replicate this behaviour by turning JavaScript off in your browser and visiting the website.您可以通过在浏览器中关闭 JavaScript 并访问网站来复制此行为。 If you scrape frequently, you may wish to download a browser plugin which allows you to turn JavaScript on and off quickly.如果您经常抓取,您可能希望下载一个浏览器插件,它可以让您快速打开和关闭 JavaScript。

To scrape websites which load HTML content with JavaScript you may wish to explore browser automation options such as selenium .要抓取使用 JavaScript 加载 HTML 内容的网站,您可能希望探索浏览器自动化选项,例如selenium

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM