简体   繁体   English

使用正则表达式抓取网页

[英]Web scraping using regex

I'm running into a wall why this code does not work, even thought it's the same code as on an online tutorial Python Web Scraping Tutorial 5 (Network Requests) .我遇到了为什么这段代码不起作用的问题,甚至认为它与在线教程Python Web Scraping Tutorial 5 (Network Requests)上的代码相同。 I tried running the code also via online Python interpreter.我也尝试通过在线 Python 解释器运行代码。

import urllib
import re

htmltext = urllib.urlopen("https://www.google.com/finance?q=AAPL")

regex = '<span id="ref_[^.]*_l">(.+?)</span>'
pattern = re.compile(regex)
results = re.findall(pattern,htmltext)
results

I get:我得到:

re.pyc in findall(pattern, string, flags)
175 
176     Empty matches are included in the result."""
--> 177     return _compile(pattern, flags).findall(string)
178 
179 if sys.hexversion >= 0x02020000:

TypeError: expected string or buffer 

Expected result(s):预期成绩):

112.71

Help appreciated.帮助表示赞赏。 I tried using "read()" on the url but that didn't work.我尝试在 url 上使用“read()”,但这没有用。 According to documentation even empty results should be included.根据文档,甚至应该包括空结果。 Thanks谢谢

If you follow the tutorial until the end :) :如果你按照教程直到最后:):

% python2                                                                                                     
>>> import urllib
>>> data = urllib.urlopen('https://www.google.com/finance/getprices?q=AAPL&x=NASD&i=10&p=25m&f=c&auto=1').read()
>>> print data.split()[-1]
112.71

Never use regex to web scrape切勿使用正则表达式进行网页抓取

I make improvement to fetch last array element simpler我进行了改进以更简单地获取最后一个数组元素

问题是您实际上并没有从请求中读取 HTML。

htmltext = urllib.urlopen("https://www.google.com/finance?q=AAPL").read()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM