[英]Web scraping using regex
I'm running into a wall why this code does not work, even thought it's the same code as on an online tutorial Python Web Scraping Tutorial 5 (Network Requests) .我遇到了为什么这段代码不起作用的问题,甚至认为它与在线教程Python Web Scraping Tutorial 5 (Network Requests)上的代码相同。 I tried running the code also via online Python interpreter.我也尝试通过在线 Python 解释器运行代码。
import urllib
import re
htmltext = urllib.urlopen("https://www.google.com/finance?q=AAPL")
regex = '<span id="ref_[^.]*_l">(.+?)</span>'
pattern = re.compile(regex)
results = re.findall(pattern,htmltext)
results
I get:我得到:
re.pyc in findall(pattern, string, flags)
175
176 Empty matches are included in the result."""
--> 177 return _compile(pattern, flags).findall(string)
178
179 if sys.hexversion >= 0x02020000:
TypeError: expected string or buffer
Expected result(s):预期成绩):
112.71
Help appreciated.帮助表示赞赏。 I tried using "read()" on the url but that didn't work.我尝试在 url 上使用“read()”,但这没有用。 According to documentation even empty results should be included.根据文档,甚至应该包括空结果。 Thanks谢谢
If you follow the tutorial until the end :) :如果你按照教程直到最后:):
% python2
>>> import urllib
>>> data = urllib.urlopen('https://www.google.com/finance/getprices?q=AAPL&x=NASD&i=10&p=25m&f=c&auto=1').read()
>>> print data.split()[-1]
112.71
Never use regex to web scrape切勿使用正则表达式进行网页抓取
I make improvement to fetch last array element simpler我进行了改进以更简单地获取最后一个数组元素
问题是您实际上并没有从请求中读取 HTML。
htmltext = urllib.urlopen("https://www.google.com/finance?q=AAPL").read()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.