简体   繁体   English

我的网页抓取代码(python3.4)有什么问题

[英]What is wrong with my web scraper code (python3.4)

I am trying to scrape a table from a website. 我正在尝试从网站上抓一张桌子。 It runs but I am not getting an output to my file. 它运行,但是我没有输出到我的文件。 Where am I going wrong? 我要去哪里错了?

Code: 码:

from bs4 import BeautifulSoup

import urllib.request

f = open('nbapro.txt','w')
errorFile = open('nbaerror.txt','w')

page = urllib.request.urlopen('http://www.numberfire.com/nba/fantasy/full-fantasy-basketball-projections')

content = page.read()
soup =  BeautifulSoup(content)

tableStats = soup.find('table', {'class': 'data-table xsmall'})
for row in tableStats.findAll('tr')[2:]:
 col = row.findAll('td')

 try: 
    name = col[0].a.string.strip()
    f.write(name+'\n')
 except Exception as e:
    errorFile.write (str(e) + '******'+ str(col) + '\n')
    pass

f.close
errorFile.close

The problem is that the table data you are trying to scrape is filled out by invoking javascript code on the browser-side. 问题在于,您要抓取的表数据是通过在浏览器端调用javascript代码来填写的。 urllib is not a browser and, hence, cannot execute javascript. urllib不是浏览器,因此无法执行javascript。

If you want to solve it via urllib and BeautifulSoup , you have to extract the JSON object from the script tag and load it via json.loads() . 如果要通过urllibBeautifulSoup解决它,则必须从script标记中提取JSON对象,然后通过json.loads()加载它。 Example, that prints player names: 示例,显示玩家名称:

import json
import re
import urllib.request
from bs4 import BeautifulSoup


soup = BeautifulSoup(urllib.request.urlopen('http://www.numberfire.com/nba/fantasy/full-fantasy-basketball-projections'))

script = soup.find('script', text=lambda x: x and 'NF_DATA' in x).text
data = re.search(r'NF_DATA = (.*?);', script).group(1)
data = json.loads(data)

for player_id, player in data['players'].items():
    print(player['name'] + ' ' + player['last_name'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM