Yahoo Finance-不完整的检索组件

Question

I am trying to write down a function in python to retrieve the list of components of an Index. 我试图在python中写下一个函数来检索Index的组件列表。 So lets say I want to look at FTSE100 (^FTSE), I'd like to get all its components (100s of them) or even more information. 因此，可以说我想看看FTSE100（^ FTSE），我想了解它的所有组件（其中有100多个），甚至更多的信息。

I can get more information about a components just by adding flags (see this ). 我只需添加标志就可以获取有关组件的更多信息（请参阅参考资料）。

However, given the index, I can only retrieve the first 51 components (first page of this: http://finance.yahoo.com/q/cp?s=%5EFTSE&c=0 ). 但是，给定索引，我只能检索前51个组件（其第一页： http : //finance.yahoo.com/q/cp?s=%5EFTSE&c=0 ）。

My function is: 我的职能是：

at = '%40'
def getListComponents(symbol):
    url = 'http://finance.yahoo.com/d/quotes.csv?s=%s%s&c=1&f=s' % (at, symbol)
    return urllib.urlopen(url).read().strip().strip('"')

Output example: 
'AAL.L"\r\n"ABF.L"\r\n"ADM.L"\r\n"ADN.L"\r\n"AGK.L"\r\n"AMEC.L"\r\n"ANTO.L"\r\n"ARM.L"\r\n"AV.L"\r\n"AZN.L"\r\n"BA.L"\r\n"BAB.L"\r\n"BARC.L"\r\n"BATS.L"\r\n"BG.L"\r\n"BLND.L"\r\n"BLT.L"\r\n"BNZL.L"\r\n"BP.L"\r\n"BRBY.L"\r\n"BSY.L"\r\n"BT-A.L"\r\n"CCL.L"\r\n"CNA.L"\r\n"CPG.L"\r\n"CPI.L"\r\n"CRDA.L"\r\n"CRH.L"\r\n"CSCG.L"\r\n"DGE.L"\r\n"ENRC.L"\r\n"EVR.L"\r\n"EXPN.L"\r\n"FRES.L"\r\n"GFS.L"\r\n"GKN.L"\r\n"GLEN.L"\r\n"GSK.L"\r\n"HL.L"\r\n"HMSO.L"\r\n"HSBA.L"\r\n"IAG.L"\r\n"IHG.L"\r\n"IMI.L"\r\n"IMT.L"\r\n"ITRK.L"\r\n"ITV.L"\r\n"JMAT.L"\r\n"KAZ.L"\r\n"KGF.L"\r\n"LAND.L'

This way getting parsing the components'titles is very easy. 这种获取组件标题的方法非常简单。

How can I get the remaning 49 components? 如何获得剩余的49个组件？ Take in consideration, that the components not retrieved could be even more in case I was looking at FTSE250 or higher. 考虑到，如果我使用的是FTSE250或更高版本，则未检索到的组件可能会更多。

THE NO ANSWER: 没有答案：

So I did some research, tried many combinations of flags, found and read this thread of comments: code.google.com/p/yahoo-finance-managed/wiki/csvQuotesDownload ; 因此，我进行了一些研究，尝试了多种标志组合，发现并阅读了以下注释线程：code.google.com/p/yahoo-finance-managed/wiki/csvQuotesDownload； AND I concluded that it's not possible to download all the components of an index as CSV. 并且我得出结论，不可能将索引的所有组件下载为CSV。

If you have/had the same problem than just use BeautifulSoup. 如果您有/曾经有过同样的问题，则不只是使用BeautifulSoup。 You may not like this approach, but there's not another way. 您可能不喜欢这种方法，但是没有其他方法。

Solution to most of my problems 解决我大部分问题

Answer 1

If you're doing it that way, there's a little link at the top of the table that says last - which'll give you the last page number - http://finance.yahoo.com/q/cp?s=%5EFTSE&c=2 (from your example) then split that out to create a range range(number) to loop over and request pages similar to how you're doing at the moment. 如果您这样做的话，表格顶部会显示last一个小链接-这将为您提供最后的页码-http: http://finance.yahoo.com/q/cp?s=%5EFTSE&c=2 s http://finance.yahoo.com/q/cp?s=%5EFTSE&c=2 （来自您的示例）然后将其拆分以创建一个范围range(number)以循环并请求与您当前操作类似的页面。

Open initial page 打开初始页
Extract link via lxml.html or BeautifulSoup 通过lxml.html或BeautifulSoup提取链接
Parse out the last page number 解析出最后一个页码
Loop over number of pages retrieving each 循环检索每个页面的页面数

On a side note, I'm pretty sure Yahoo! 附带一提，我很确定Yahoo! must have an API for some of this? 必须为此有一个API？

Answer 2

I am new to Python and finding my feet. 我是Python的新手，正在寻找自己的脚。

I was looking for a solution to the same problem, but ended up writing my own. 我一直在寻找解决同一问题的方法，但最终还是自己写了。 My code is inefficient, lengthy and ugly - but it works and I will use if rarely. 我的代码效率低下，冗长且丑陋-但是它可以正常工作，如果很少的话我会使用。 I look forward to learning from someone wiser. 我期待向聪明的人学习。

def getIndexComponents(symbol): def getIndexComponents（symbol）：

# function to retrieve the component list of equity index
# from Yahoo Finance, if available

import requests
p = 0

while p < 12:

    if p == 0:

        url = 'http://finance.yahoo.com/q/cp?s=%5E' + symbol
        text = requests.get(url).content
                                                                              # </a></b></td><td
        componentSubset = [text[n-10:n] for n in xrange(len(text)) if text.find('</a></b></td><td', n) == n]

        for comp in range(len(componentSubset)):

            componentSubset[comp] = componentSubset[comp][(1+componentSubset[comp].index('>')):]

        components = componentSubset

    else:

        url = 'http://finance.yahoo.com/q/cp?s=%5E' + symbol + '&c=' + str(p)
        text = requests.get(url).content

        componentSubset = [text[n-10:n] for n in xrange(len(text)) if text.find('</a></b></td><td', n) == n]

        for comp in range(len(componentSubset)):

            componentSubset[comp] = componentSubset[comp][(1+componentSubset[comp].index('>')):]

        components.extend(componentSubset)

    p = p + 1

components = set(components)

return components

seems to work 似乎有效

getIndexComponents('FTSE') getIndexComponents（'FTSE'）

Yahoo Finance-不完整的检索组件

问题描述

2 个解决方案

解决方案1
5 2012-11-16 17:23:50

解决方案2
1 2016-01-04 14:47:37

seems to work 似乎有效

Yahoo Finance-不完整的检索组件

问题描述

2 个解决方案

解决方案1 5 2012-11-16 17:23:50

解决方案2 1 2016-01-04 14:47:37

seems to work 似乎有效

解决方案1
5 2012-11-16 17:23:50

解决方案2
1 2016-01-04 14:47:37