Python Web抓取问题

Question

I am using Python to scrape AAPL's stock price from Yahoo finance. 我正在使用Python从雅虎财务中攫取AAPL的股票价格。 But the program always returns [] . 但程序总是返回[] 。 I would appreciate if someone could point out why the program is not working. 如果有人能指出该计划无效的原因，我将不胜感激。 Here is my code: 这是我的代码：

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

The original source is like this: 原始来源是这样的：

<span id="yfs_l84_aapl" class>112.31</span>

Here I just want the price 112.31. 在这里，我只想要价格112.31。 I copy and paste the code and find 'class' changes to 'class=""'. 我复制并粘贴代码并找到'class'更改为'class =“”'。 I also tried code 我也试过代码

regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'

But it does not work either. 但它也不起作用。

Answer 1

Well, the good news is that you are getting the data. 嗯，好消息是你得到了数据。 You were nearly there. 你快到了。 I would recommend that you work our your regex problems in a tool that helps, eg regex101 . 我建议你在一个有用的工具中解决你的正则表达式问题，例如regex101 。

Anyway, here is your working regex: 无论如何，这是你的工作正则表达式：

regex='<span id="yfs_l84_aapl">(\d*\.\d\d)'

You are collecting only digits, so don't do the general catch, be specific where you can. 你只收集数字，所以不要做一般的捕获，具体到你可以。 This is multiple digits, with a decimal literal, with two more digits. 这是多位数，带有十进制文字，还有两位数。

Answer 2

When I went to the yahoo site you provided, I saw a span tag without class attribute. 当我去你提供的雅虎网站时，我看到了一个没有class属性的span标签。

<span id="yfs_l84_aapl">112.31</span>

Not sure what you are trying to do with "class." 不确定你要用“课堂”做什么。 Without that I get 112.31 没有我，我得到112.31

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

Answer 3

I am using BeautifulSoup to get the text from span tag 我正在使用BeautifulSoup从span标签中获取文本

import urllib
from BeautifulSoup import BeautifulSoup

response =urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
html = response.read()
soup = BeautifulSoup(html)
# find all the spans have id = 'yfs_l84_aapl'
target = soup.findAll('span',{'id':"yfs_l84_aapl"})
# target is a list 
print(target[0].string)

Python Web抓取问题

问题描述

3 个解决方案

解决方案1
5 已采纳 2015-09-09 00:51:38

解决方案2
2 2015-09-09 00:58:47

解决方案3
1 2015-09-09 01:33:35

Python Web抓取问题

问题描述

3 个解决方案

解决方案1 5 已采纳 2015-09-09 00:51:38

解决方案2 2 2015-09-09 00:58:47

解决方案3 1 2015-09-09 01:33:35

解决方案1
5 已采纳 2015-09-09 00:51:38

解决方案2
2 2015-09-09 00:58:47

解决方案3
1 2015-09-09 01:33:35