在python中使用beautifulsoup解析表

Question

我想遍歷每一行並捕獲td.text的值。 但是這里的問題是表沒有類。 並且所有的td都有相同的類名。 我想遍歷每一行並想要以下輸出：

第一行）“ AMERICANS SOCCER CLUB”，“ B11EB-AMERICANS-B11EB-WARZALA”，“ Cameron Coya”，“ Player 228004”，“ 2016-09-10”，“玩家持續侵犯游戲法則”，“ C “ （新隊）

第二行）“ AVIATORS SOCCER CLUB”，“ G12DB-AVIATORS-G12DB-REYNGOUDT”，“ Saskia Reyes”，“ Player 224463”，“ 2016-09-11”，“播放器/未遵守體育行為的人”，“ C” （新隊）

<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
    <tbody>
        <tr class="tblHeading">
            <td colspan="7">AMERICANS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Cameron Coya                                       </td>
            <td width="19%" class="tdUnderLine">
                Rozel, Max
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         
                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=228004" target="_blank">228004</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/10/16 02:15 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">AVIATORS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td> 
        </tr>
        <tr bgcolor="#FBFBFB">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Saskia Reyes                                       </td>
            <td width="19%" class="tdUnderLine">
                HollaenderNardelli, Eric
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=224463" target="_blank">224463</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/11/16 06:45 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player/sub guilty of unsporting behavior     </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">BERGENFIELD SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Christian Latorre                                  </td>
            <td width="19%" class="tdUnderLine">
                Coyle, Kevin
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-10-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=226294" target="_blank">226294</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 

                09/10/16 11:00 AM   

            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>

我嘗試了以下代碼。

import requests
from bs4 import BeautifulSoup
import re
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html"
page = open(url, encoding="utf8")
soup = BeautifulSoup(page.read(),"html.parser")

#tableList = soup.findAll("table")

for tr in soup.find_all("tr"):
    for td in tr.find_all("td"):
        print(td.text.strip())

但是很明顯它將返回所有td形式的文本，並且我將無法識別特定的列名或無法確定新記錄的開始。 我想知道

1）如何識別每一列（因為類名相同）並且還有標題（如果您提供相應的代碼，我將不勝感激）

2）如何在這種結構中識別新記錄

Answer 1

如果數據的結構確實像表一樣，則很有可能直接使用pd.read_table（）將其讀入pandas中。 請注意，它接受filepath_or_buffer參數中的url。 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html

Answer 2

count = 0
string = ""
for td in soup.find_all("td"):
string += "\""+td.text.strip()+"\","
count +=1
if(count % 9 ==0):
    print string[:-1] + "\n\n" # string[:-1] to remove the last ","
    string = ""

由於表的格式不正確，我們將僅使用td，而不是進入每一行，然后進入每一行的td，這會使工作復雜化。 我只是使用了一個字符串，您可以將數據附加到列表列表中並對其進行處理以備后用。
希望這能解決您的問題

Answer 3

似乎有一種模式。 每7 tr（s）之后，會有一個新行。 因此，您可以做的是保持計數器從1開始，當計數器達到7時，添加新行並將其重新啟動為0。

counter = 1
for tr in find_all("tr"):
    for td in tr.find_all("td"):
        # place code
    counter = counter + 1
    if counter == 7:
        print "\n"
        counter = 1

Answer 4

from __future__ import print_function
import re
import datetime
from bs4 import BeautifulSoup

soup = ""
with open("/tmp/a.html") as page:
   soup = BeautifulSoup(page.read(),"html.parser")

table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table')

trs = table.find_all('tr')

table_dict = {}
game = ""
section = ""

for tr in trs:
    if tr.has_attr('class'):
        game = tr.text.strip('\n')
    if tr.has_attr('bgcolor'):
        if tr['bgcolor'] == '#CCE4F1':
            section = tr.text.strip('\n')
        else:
            tds = tr.find_all('td')
            extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds]
            extracted_text = [x.strip() for x in extracted_text]
            extracted_text = list(filter(lambda x: len(x) > 2, extracted_text))
            extracted_text.pop(1)
            extracted_text[2] = "Player " + extracted_text[2]
            extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
            extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text]
            print(','.join(extracted_text))

並在運行時：

$ python a.py

"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C"
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C"
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"

根據與OP的進一步交談，輸入為https://paste.fedoraproject.org/428111/87928814/raw/ ，運行上述代碼后的輸出為： https : //paste.fedoraproject.org/428110/38792211 /生的/

在python中使用beautifulsoup解析表

問題描述

4 個解決方案

解決方案1
1 2016-09-14 05:02:18

解決方案2
1 2016-09-14 06:47:34

解決方案3
0 2016-09-14 06:26:40

解決方案4
0 已采納 2016-09-14 06:37:41

在python中使用beautifulsoup解析表

問題描述

4 個解決方案

解決方案1 1 2016-09-14 05:02:18

解決方案2 1 2016-09-14 06:47:34

解決方案3 0 2016-09-14 06:26:40

解決方案4 0 已采納 2016-09-14 06:37:41

解決方案1
1 2016-09-14 05:02:18

解決方案2
1 2016-09-14 06:47:34

解決方案3
0 2016-09-14 06:26:40

解決方案4
0 已采納 2016-09-14 06:37:41