简体   繁体   中英

Python – Extract certain links from website

I want to extract certain links from a website .

To extract all links, I tried:

import urllib
import xml.etree.ElementTree as ET
from BeautifulSoup import *

url = 'http://pdok.bundestag.de/index.php?qsafe=&aload=off&q=kleine+anfrage&x=0&y=0&df=22.10.2013&dt=13.01.2016'
uh = urllib.urlopen(url)
data = uh.read()
soup=BeautifulSoup(data)
soup.prettify()

for href in soup.findAll('a'):
    print href

Now, I get a list of links, but for some reason I don't get the important links that are in tbody . I also tried using ElementTree, but I get an error just reading the link, because it uses some invalid symbols or so (?). Any help is much appreciated! :)

urllib loads the HTML of the website with Javascript off . The links that you are trying to scrape in the tbody are rendered by JavaScript, so never load.

You can replicate this behaviour by turning JavaScript off in your browser and visiting the website. If you scrape frequently, you may wish to download a browser plugin which allows you to turn JavaScript on and off quickly.

To scrape websites which load HTML content with JavaScript you may wish to explore browser automation options such as selenium .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM