简体   繁体   中英

Shaping pandas read_html results into simpler structure

I was hoping someone could advise me how to create pandas dataframe that only contained the text from the 2nd column and not the 1st 2 rows or the left column. The solution needs to be able to cope with multiple similar tables.

I had thought pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4') to create a list of dataframes from the html (skipping the 2 rows) would be the way but the end data structure was too confusing for this novice to understand or manipulate into a simpler structure.

Would others have a way of working with the structure that results or recommend alternative ways of refining the data so I end up with 1 column containing just the text I need?

Sample Table

<table cellpadding="5" cellspacing="0" class="borders" width="100%">
    <tr>
     <th colspan="2">
      Learning Outcomes
     </th>
    </tr>
    <tr>
     <td class="info" colspan="2">
      On successful completion of this module the learner will be able to:
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO1
     </td>
     <td>
      Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO2
     </td>
     <td>
      Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO3
     </td>
     <td>
      Understand the various formats in which  information in relation to transactions or events is recorded and classified.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO4
     </td>
     <td>
      Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the  posting of  recorded information to the T accounts in the Nominal Ledger.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO5
     </td>
     <td>
      Prepare and present the financial statements of a Sole Trader  in prescribed format from a Trial Balance  accompanies by notes with additional information.
     </td>
    </tr>
   </table> 

first option
use iloc

This should work by letting iloc get rid of the first column`

pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1:]

explanation

...iloc[:, 1:]
#       ^   ^
#       |    \
# says to    says to take columns
# take all   starting with one and on
# rows

You could take just the single column with

pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1]

working code that I ran

htm = """<table cellpadding="5" cellspacing="0" class="borders" width="100%">
    <tr>
     <th colspan="2">
      Learning Outcomes
     </th>
    </tr>
    <tr>
     <td class="info" colspan="2">
      On successful completion of this module the learner will be able to:
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO1
     </td>
     <td>
      Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO2
     </td>
     <td>
      Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO3
     </td>
     <td>
      Understand the various formats in which  information in relation to transactions or events is recorded and classified.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO4
     </td>
     <td>
      Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the  posting of  recorded information to the T accounts in the Nominal Ledger.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO5
     </td>
     <td>
      Prepare and present the financial statements of a Sole Trader  in prescribed format from a Trial Balance  accompanies by notes with additional information.
     </td>
    </tr>
   </table> """

pd.read_html(htm,skiprows=2, flavor='bs4')[0].iloc[:, 1:]

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM