簡體   English   中英

將 pandas read_html 結果塑造成更簡單的結構

[英]Shaping pandas read_html results into simpler structure

我希望有人可以建議我如何創建只包含第 2 列而不是第 1 2 行或左列中的文本的 Pandas 數據框。 解決方案需要能夠處理多個相似的表。

我原以為pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4')從 html 創建一個數據pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4')列表(跳過 2 行)會是這樣,但最終的數據結構太混亂了讓這個新手理解或操作成更簡單的結構。

其他人是否有辦法處理產生的結構或推薦改進數據的替代方法,以便我最終得到 1 列僅包含我需要的文本?

樣品表

<table cellpadding="5" cellspacing="0" class="borders" width="100%">
    <tr>
     <th colspan="2">
      Learning Outcomes
     </th>
    </tr>
    <tr>
     <td class="info" colspan="2">
      On successful completion of this module the learner will be able to:
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO1
     </td>
     <td>
      Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO2
     </td>
     <td>
      Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO3
     </td>
     <td>
      Understand the various formats in which  information in relation to transactions or events is recorded and classified.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO4
     </td>
     <td>
      Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the  posting of  recorded information to the T accounts in the Nominal Ledger.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO5
     </td>
     <td>
      Prepare and present the financial statements of a Sole Trader  in prescribed format from a Trial Balance  accompanies by notes with additional information.
     </td>
    </tr>
   </table> 

第一個選項
使用iloc

這應該通過讓iloc擺脫第一列來工作

pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1:]

解釋

...iloc[:, 1:]
#       ^   ^
#       |    \
# says to    says to take columns
# take all   starting with one and on
# rows

你可以只用單列

pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1]

我運行的工作代碼

htm = """<table cellpadding="5" cellspacing="0" class="borders" width="100%">
    <tr>
     <th colspan="2">
      Learning Outcomes
     </th>
    </tr>
    <tr>
     <td class="info" colspan="2">
      On successful completion of this module the learner will be able to:
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO1
     </td>
     <td>
      Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO2
     </td>
     <td>
      Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO3
     </td>
     <td>
      Understand the various formats in which  information in relation to transactions or events is recorded and classified.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO4
     </td>
     <td>
      Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the  posting of  recorded information to the T accounts in the Nominal Ledger.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO5
     </td>
     <td>
      Prepare and present the financial statements of a Sole Trader  in prescribed format from a Trial Balance  accompanies by notes with additional information.
     </td>
    </tr>
   </table> """

pd.read_html(htm,skiprows=2, flavor='bs4')[0].iloc[:, 1:]

在此處輸入圖片說明

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM