使用正则表达式抓取网页

Question

I'm running into a wall why this code does not work, even thought it's the same code as on an online tutorial Python Web Scraping Tutorial 5 (Network Requests) .我遇到了为什么这段代码不起作用的问题，甚至认为它与在线教程Python Web Scraping Tutorial 5 (Network Requests)上的代码相同。 I tried running the code also via online Python interpreter.我也尝试通过在线 Python 解释器运行代码。

import urllib
import re

htmltext = urllib.urlopen("https://www.google.com/finance?q=AAPL")

regex = '<span id="ref_[^.]*_l">(.+?)</span>'
pattern = re.compile(regex)
results = re.findall(pattern,htmltext)
results

I get:我得到：

re.pyc in findall(pattern, string, flags)
175 
176     Empty matches are included in the result."""
--> 177     return _compile(pattern, flags).findall(string)
178 
179 if sys.hexversion >= 0x02020000:

TypeError: expected string or buffer

Expected result(s):预期成绩）：

112.71

Help appreciated.帮助表示赞赏。 I tried using "read()" on the url but that didn't work.我尝试在 url 上使用“read()”，但这没有用。 According to documentation even empty results should be included.根据文档，甚至应该包括空结果。 Thanks谢谢

Answer 1

If you follow the tutorial until the end :) :如果你按照教程直到最后:)：

% python2                                                                                                     
>>> import urllib
>>> data = urllib.urlopen('https://www.google.com/finance/getprices?q=AAPL&x=NASD&i=10&p=25m&f=c&auto=1').read()
>>> print data.split()[-1]
112.71

Never use regex to web scrape切勿使用正则表达式进行网页抓取

I make improvement to fetch last array element simpler我进行了改进以更简单地获取最后一个数组元素

Answer 2

问题是您实际上并没有从请求中读取 HTML。

htmltext = urllib.urlopen("https://www.google.com/finance?q=AAPL").read()

使用正则表达式抓取网页

问题描述

2 个解决方案

解决方案1
1 2016-09-24 12:11:31

解决方案2
0 2016-09-24 09:56:23

使用正则表达式抓取网页

问题描述

2 个解决方案

解决方案1 1 2016-09-24 12:11:31

解决方案2 0 2016-09-24 09:56:23

解决方案1
1 2016-09-24 12:11:31

解决方案2
0 2016-09-24 09:56:23