如何从有加载表的网站进行网页抓取？

Question

我尝试从使用Python 2.7的网站进行网络爬虫，其中有一个表必须加载。 如果我要在网络上抓取它，我只会得到：“正在加载”或“对不起，我们没有关于它的任何信息”，因为它必须首先加载。

我读了一些文章和代码，但没有任何效果。

我的代码：

 import urllib2, sys from BeautifulSoup import BeautifulSoup import json site= "https://www.flightradar24.com/data/airports/bud/arrivals" hdr = {'User-Agent': 'Mozilla/5.0'} req = urllib2.Request(site,headers=hdr) page = urllib2.urlopen(req) soup = BeautifulSoup(page) nev = soup.find('h1' , attrs={'class' : 'airport-name'}) print nev table = soup.find('div', { "class" : "row cnt-schedule-table" }) print table

 import urllib2 from bs4 import BeautifulSoup import json # new url url = 'https://www.flightradar24.com/data/airports/bud/arrivals' # read all data page = urllib2.urlopen(url).read() # convert json text to python dictionary data = json.loads(page) print(data['row cnt-schedule-table'])

Answer 1

我也面临这个问题..您可以使用python硒包。 我们需要等待加载您的表，因此我使用time.sleep（），但这不是正确的方法。您可以使用wait.until（“ element”）方法PFB示例代码进行登录

from bs4 import BeautifulSoup
from selenium import webdriver
import time
profile=webdriver.FirefoxProfile()
profile.set_preference("intl.accept_languages","en-us")
driver = webdriver.Firefox(firefox_profile=profile)
driver.get("https://www.flightradar24.com/data/airports/bud/arrivals")
time.sleep(10)
html_source=driver.page_source
soup=BeautifulSoup(html_source,"html.parser")
print soup

参考链接。

硒waitForElement

如何从有加载表的网站进行网页抓取？

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-07-25 07:15:36

如何从有加载表的网站进行网页抓取？

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-07-25 07:15:36

解决方案1
1 已采纳 2017-07-25 07:15:36