简体   繁体   English

在 Canopy 上使用 Python 进行网页抓取

[英]Web-Scraping with Python on Canopy

I'm having trouble with this line of code in which I want to print the 4 stock prices for the companies listed.我在这行代码中遇到了问题,我想在其中打印列出的公司的 4 个股票价格。 My issue is that, while there are no errors when I run it, the code only prints out empty brackets where the stock prices should go.我的问题是,虽然我运行它时没有错误,但代码只打印出股票价格应该去的空括号。 This is the source of my confusion.这是我困惑的根源。

import urllib2
import re

symbolslist = ["aapl","spy","goog","nflx"]
i = 0

while i<len(symbolslist):
    url = "http://money.cnn.com/quote/quote.html?symb=' +symbolslist[i] + '"
    htmlfile = urllib2.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span stream='+symbolslist[i]+' streamformat="ToHundredth" streamfeed="SunGard">(.+?)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern,htmltext)
    print "the price of", symbolslist[i], " is ", price
    i+=1

Because you don't pass the variable:因为你没有传递变量:

 url = "http://money.cnn.com/quote/quote.html?symb=' +symbolslist[i] + '"
                                                         ^^^^^
                                                      a string not the list element

Use str.format :使用str.format

url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbolslist[i])

Also you can iterate directly over the list, no need for a while loop, never parse html with a regex , use a html parse like bs4 and your regex is also wrong.你也可以直接遍历列表,不需要while循环,永远不要用正则表达式解析html ,使用像bs4这样的html解析,你的正则表达式也是错误的。 There is no stream="aapl" etc.. what you want is the span where streamformat="ToHundredth" and streamfeed="SunGard" ;没有stream="aapl"等等。你想要的是streamformat="ToHundredth"streamfeed="SunGard"的跨度;

import urllib2
from bs4 import BeautifulSoup

symbolslist = ["aapl","spy","goog","nflx"]


for symbol in symbolslist:
    url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbol)
    htmlfile = urllib2.urlopen(url)
    soup = BeautifulSoup(htmlfile.read())
    price = soup.find("span",streamformat="ToHundredth", streamfeed="SunGard").text
    print "the price of {} is {}".format(symbol,price)

You can see if we run the code:你可以看看我们是否运行了代码:

In [1]: import urllib2

In [2]: from bs4 import BeautifulSoup

In [3]: symbols_list = ["aapl", "spy", "goog", "nflx"]

In [4]: for symbol in symbols_list:
   ...:         url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbol)
   ...:         htmlfile = urllib2.urlopen(url)
   ...:         soup = BeautifulSoup(htmlfile.read(), "html.parser")
   ...:         price = soup.find("span",streamformat="ToHundredth", streamfeed="SunGard").text
   ...:         print "the price of {} is {}".format(symbol,price)
   ...:     
the price of aapl is 115.57
the price of spy is 215.28
the price of goog is 771.76
the price of nflx is 97.34

We get what you want.我们得到你想要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM