简体   繁体   English

Python Web抓取问题

[英]Python Web Scraping Problems

I am using Python to scrape AAPL's stock price from Yahoo finance. 我正在使用Python从雅虎财务中攫取AAPL的股票价格。 But the program always returns [] . 但程序总是返回[] I would appreciate if someone could point out why the program is not working. 如果有人能指出该计划无效的原因,我将不胜感激。 Here is my code: 这是我的代码:

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

The original source is like this: 原始来源是这样的:

<span id="yfs_l84_aapl" class>112.31</span>

Here I just want the price 112.31. 在这里,我只想要价格112.31。 I copy and paste the code and find 'class' changes to 'class=""'. 我复制并粘贴代码并找到'class'更改为'class =“”'。 I also tried code 我也试过代码

regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'

But it does not work either. 但它也不起作用。

Well, the good news is that you are getting the data. 嗯,好消息是你得到了数据。 You were nearly there. 你快到了。 I would recommend that you work our your regex problems in a tool that helps, eg regex101 . 我建议你在一个有用的工具中解决你的正则表达式问题,例如regex101

Anyway, here is your working regex: 无论如何,这是你的工作正则表达式:

regex='<span id="yfs_l84_aapl">(\d*\.\d\d)'

You are collecting only digits, so don't do the general catch, be specific where you can. 你只收集数字,所以不要做一般的捕获,具体到你可以。 This is multiple digits, with a decimal literal, with two more digits. 这是多位数,带有十进制文字,还有两位数。

When I went to the yahoo site you provided, I saw a span tag without class attribute. 当我去你提供的雅虎网站时,我看到了一个没有class属性的span标签。

<span id="yfs_l84_aapl">112.31</span>

Not sure what you are trying to do with "class." 不确定你要用“课堂”做什么。 Without that I get 112.31 没有我,我得到112.31

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

I am using BeautifulSoup to get the text from span tag 我正在使用BeautifulSoup从span标签中获取文本

import urllib
from BeautifulSoup import BeautifulSoup

response =urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
html = response.read()
soup = BeautifulSoup(html)
# find all the spans have id = 'yfs_l84_aapl'
target = soup.findAll('span',{'id':"yfs_l84_aapl"})
# target is a list 
print(target[0].string)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM