简体   繁体   中英

Pandas read_html() missing columns

I am using following read_html() call to read a table (behind a paywall):

df = pd.read_html('http://markets.ft.com/data/equities/tearsheet/' + 
              'financials?s=BAG:LSE&subView=BalanceSheet&periodType=a')[0]

It parses fine, other than that it is missing the last two columns. I am using a recent version of Anaconda (Python 3.5, pandas 0.18.1, html5lib, BeautifulSoup4).

The start of the output looks like this:

                Fiscal data as of Jan 30 2016  2016    2015    2014
                                      ASSETS   NaN     NaN     NaN
             Cash And Short Term Investments  6.80      25      13
                      Total Receivables, Net    50      49      45
                             Total Inventory    16      17      16

(too large to display it all)

The start of the HTML looks like this:

<table class="mod-ui-table">
            <thead>
                <tr>
                    <th class="mod-ui-table__header--text">Fiscal data as of Jan 30 2016</th>
                    <th>2016</th>
                    <th class="mod-ui-hide-xsmall">2015</th>
                    <th class="mod-ui-hide-xsmall">2014</th>
                    <th class="mod-ui-hide-xsmall">2013</th>
                    <th class="mod-ui-hide-xsmall">2012</th>
                </tr>
            </thead>
            <tr class="mod-ui-table__row--section-header">
                <th colspan="6">ASSETS</th>
            </tr>
            <tr class="mod-ui-table__row--striped">
                <th class="mod-ui-table__header--row-label">Cash And Short Term Investments</th>
                <td>6.80</td>
                <td class="mod-ui-hide-xsmall">25</td>
                <td class="mod-ui-hide-xsmall">13</td>
                <td class="mod-ui-hide-xsmall">0.91</td>
                <td class="mod-ui-hide-xsmall">8.29</td>
            </tr>
            <tr>
                <th class="mod-ui-table__header--row-label">Total Receivables, Net</th>
                <td>50</td>
                <td class="mod-ui-hide-xsmall">49</td>
                <td class="mod-ui-hide-xsmall">45</td>
                <td class="mod-ui-hide-xsmall">42</td>
                <td class="mod-ui-hide-xsmall">37</td>
            </tr>

The end of the HTML looks like this:

<tr class="mod-ui-table__row--highlight">
                    <th class="mod-ui-table__header--row-label">Total liabilities &amp; shareholders&#39; equity</th>
                    <td>269</td>
                    <td class="mod-ui-hide-xsmall">255</td>
                    <td class="mod-ui-hide-xsmall">227</td>
                    <td class="mod-ui-hide-xsmall">215</td>
                    <td class="mod-ui-hide-xsmall">196</td>
                </tr>
                <tr class="mod-ui-table__row--striped">
                    <th class="mod-ui-table__header--row-label">Total common shares outstanding</th>
                    <td>117</td>
                    <td class="mod-ui-hide-xsmall">117</td>
                    <td class="mod-ui-hide-xsmall">117</td>
                    <td class="mod-ui-hide-xsmall">117</td>
                    <td class="mod-ui-hide-xsmall">117</td>
                </tr>
                <tr>
                    <th class="mod-ui-table__header--row-label">Treasury shares - common primary issue</th>
                    <td>0</td>
                    <td class="mod-ui-hide-xsmall">0</td>
                    <td class="mod-ui-hide-xsmall">0</td>
                    <td class="mod-ui-hide-xsmall">0</td>
                    <td class="mod-ui-hide-xsmall">--</td>
                </tr>
            </table>

If it's not immediately obvious what might be wrong, I'd be grateful for some hints on how to start stepping through the read_html() code to find the source of the problem. I am pretty novice at Python/pdb at the moment.

It turns out that if you are not logged into the FT website, you only get three years of data.

So I am now proceeding to work out how to log into the FT website (perhaps using Twill).

There is a related question here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM