简体   繁体   English

熊猫read_html()缺少列

[英]Pandas read_html() missing columns

I am using following read_html() call to read a table (behind a paywall): 我正在使用以下read_html()调用来读取表(在付费墙后面):

df = pd.read_html('http://markets.ft.com/data/equities/tearsheet/' + 
              'financials?s=BAG:LSE&subView=BalanceSheet&periodType=a')[0]

It parses fine, other than that it is missing the last two columns. 它解析得很好,除了缺少最后两列。 I am using a recent version of Anaconda (Python 3.5, pandas 0.18.1, html5lib, BeautifulSoup4). 我正在使用Anaconda的最新版本(Python 3.5,pandas 0.18.1,html5lib,BeautifulSoup4)。

The start of the output looks like this: 输出的开始看起来像这样:

                Fiscal data as of Jan 30 2016  2016    2015    2014
                                      ASSETS   NaN     NaN     NaN
             Cash And Short Term Investments  6.80      25      13
                      Total Receivables, Net    50      49      45
                             Total Inventory    16      17      16

(too large to display it all) (太大,无法全部显示)

The start of the HTML looks like this: HTML的开始看起来像这样:

<table class="mod-ui-table">
            <thead>
                <tr>
                    <th class="mod-ui-table__header--text">Fiscal data as of Jan 30 2016</th>
                    <th>2016</th>
                    <th class="mod-ui-hide-xsmall">2015</th>
                    <th class="mod-ui-hide-xsmall">2014</th>
                    <th class="mod-ui-hide-xsmall">2013</th>
                    <th class="mod-ui-hide-xsmall">2012</th>
                </tr>
            </thead>
            <tr class="mod-ui-table__row--section-header">
                <th colspan="6">ASSETS</th>
            </tr>
            <tr class="mod-ui-table__row--striped">
                <th class="mod-ui-table__header--row-label">Cash And Short Term Investments</th>
                <td>6.80</td>
                <td class="mod-ui-hide-xsmall">25</td>
                <td class="mod-ui-hide-xsmall">13</td>
                <td class="mod-ui-hide-xsmall">0.91</td>
                <td class="mod-ui-hide-xsmall">8.29</td>
            </tr>
            <tr>
                <th class="mod-ui-table__header--row-label">Total Receivables, Net</th>
                <td>50</td>
                <td class="mod-ui-hide-xsmall">49</td>
                <td class="mod-ui-hide-xsmall">45</td>
                <td class="mod-ui-hide-xsmall">42</td>
                <td class="mod-ui-hide-xsmall">37</td>
            </tr>

The end of the HTML looks like this: HTML的结尾如下所示:

<tr class="mod-ui-table__row--highlight">
                    <th class="mod-ui-table__header--row-label">Total liabilities &amp; shareholders&#39; equity</th>
                    <td>269</td>
                    <td class="mod-ui-hide-xsmall">255</td>
                    <td class="mod-ui-hide-xsmall">227</td>
                    <td class="mod-ui-hide-xsmall">215</td>
                    <td class="mod-ui-hide-xsmall">196</td>
                </tr>
                <tr class="mod-ui-table__row--striped">
                    <th class="mod-ui-table__header--row-label">Total common shares outstanding</th>
                    <td>117</td>
                    <td class="mod-ui-hide-xsmall">117</td>
                    <td class="mod-ui-hide-xsmall">117</td>
                    <td class="mod-ui-hide-xsmall">117</td>
                    <td class="mod-ui-hide-xsmall">117</td>
                </tr>
                <tr>
                    <th class="mod-ui-table__header--row-label">Treasury shares - common primary issue</th>
                    <td>0</td>
                    <td class="mod-ui-hide-xsmall">0</td>
                    <td class="mod-ui-hide-xsmall">0</td>
                    <td class="mod-ui-hide-xsmall">0</td>
                    <td class="mod-ui-hide-xsmall">--</td>
                </tr>
            </table>

If it's not immediately obvious what might be wrong, I'd be grateful for some hints on how to start stepping through the read_html() code to find the source of the problem. 如果不是很明显什么地方出了问题,我将不胜感激关于如何开始逐步阅读read_html()代码以查找问题根源的一些提示。 I am pretty novice at Python/pdb at the moment. 我现在是Python / pdb的新手。

It turns out that if you are not logged into the FT website, you only get three years of data. 事实证明,如果您未登录FT网站,则只能获得三年的数据。

So I am now proceeding to work out how to log into the FT website (perhaps using Twill). 因此,我现在着手研究如何登录FT网站(也许使用Twill)。

There is a related question here 还有一个相关的问题在这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM