如何使用rowspan和colspan解析表

Question

首先，我閱讀了解析帶有rowspan和colspan的表。 我甚至回答了這個問題。 請在將此標記為重復之前閱讀。

<table border="1">
  <tr>
    <th>A</th>
    <th>B</th>
  </tr>
  <tr>
    <td rowspan="2">C</td>
    <td rowspan="1">D</td>
  </tr>
  <tr>
    <td>E</td>
    <td>F</td>
  </tr>
  <tr>
    <td>G</td>
    <td>H</td>
  </tr>
</table>

它將渲染出來

+---+---+---+
| A | B |   |
+---+---+   |
|   | D |   |
+ C +---+---+
|   | E | F |
+---+---+---+
| G | H |   |
+---+---+---+

<table border="1">
  <tr>
    <th>A</th>
    <th>B</th>
  </tr>
  <tr>
    <td rowspan="2">C</td>
    <td rowspan="2">D</td>
  </tr>
  <tr>
    <td>E</td>
    <td>F</td>
  </tr>
  <tr>
    <td>G</td>
    <td>H</td>
  </tr>
</table>

但是，這將呈現如下。

+---+---+-------+
| A | B |       |
+---+---+-------+
|   |   |       |
| C | D +---+---+
|   |   | E | F |
+---+---+---+---+
| G | H |       |
+---+---+---+---+

我之前的答案中的代碼只能解析具有第一行中定義的所有列的表。

def table_to_2d(table_tag):
    rows = table_tag("tr")
    cols = rows[0](["td", "th"])
    table = [[None] * len(cols) for _ in range(len(rows))]
    for row_i, row in enumerate(rows):
        for col_i, col in enumerate(row(["td", "th"])):
            insert(table, row_i, col_i, col)
    return table


def insert(table, row, col, element):
    if row >= len(table) or col >= len(table[row]):
        return
    if table[row][col] is None:
        value = element.get_text()
        table[row][col] = value
        if element.has_attr("colspan"):
            span = int(element["colspan"])
            for i in range(1, span):
                table[row][col+i] = value
        if element.has_attr("rowspan"):
            span = int(element["rowspan"])
            for i in range(1, span):
                table[row+i][col] = value
    else:
        insert(table, row, col + 1, element)

soup = BeautifulSoup('''
    <table>
        <tr><th>1</th><th>2</th><th>5</th></tr>
        <tr><td rowspan="2">3</td><td colspan="2">4</td></tr>
        <tr><td>6</td><td>7</td></tr>
    </table>''', 'html.parser')
print(table_to_2d(soup.table))

我的問題是如何將表解析為2D數組，它精確地表示它在瀏覽器中的呈現方式。 或者有人可以解釋瀏覽器如何呈現表格也沒問題。

Answer 1

你不能只計算td或th細胞，不。 您必須在表中掃描以獲取每行上的列數，並將前一行中的任何活動行掃描添加到該計數。

在使用rowspans解析表的不同場景中，我跟蹤每列數的行數，以確保來自不同單元的數據最終位於正確的列中。 這里可以使用類似的技術。

第一個計數列; 只保留最高的數字。 保留行數跨度為2或更大的列表，並為每個列處理的每行減去1。 這樣你就知道每行有多少'額外'列。 獲取最高列數以構建輸出矩陣。

接下來，再次遍歷行和單元格，這次跟蹤從列號到活動計數的字典中的行間距。 再次，對任何值為2或更高的任何事物都要進行下一行。 然后移動列號以考慮任何活動的行掃描; 如果在第0列上有行行有效，則連續的第一個td實際上是第二個，等等。

您的代碼重復將跨區列和行的值復制到輸出中; 我通過在給定單元rowspan的colspan和rowspan數字上創建一個循環（每個默認為1）來多次復制該值，從而實現了相同的目標。 我忽略了重疊的細胞; HTML表規范聲明重疊單元格是一個錯誤，由用戶代理來解決沖突。 在下面的代碼中，colspan勝過了rowpan單元格。

from itertools import product

def table_to_2d(table_tag):
    rowspans = []  # track pending rowspans
    rows = table_tag.find_all('tr')

    # first scan, see how many columns we need
    colcount = 0
    for r, row in enumerate(rows):
        cells = row.find_all(['td', 'th'], recursive=False)
        # count columns (including spanned).
        # add active rowspans from preceding rows
        # we *ignore* the colspan value on the last cell, to prevent
        # creating 'phantom' columns with no actual cells, only extended
        # colspans. This is achieved by hardcoding the last cell width as 1. 
        # a colspan of 0 means “fill until the end” but can really only apply
        # to the last cell; ignore it elsewhere. 
        colcount = max(
            colcount,
            sum(int(c.get('colspan', 1)) or 1 for c in cells[:-1]) + len(cells[-1:]) + len(rowspans))
        # update rowspan bookkeeping; 0 is a span to the bottom. 
        rowspans += [int(c.get('rowspan', 1)) or len(rows) - r for c in cells]
        rowspans = [s - 1 for s in rowspans if s > 1]

    # it doesn't matter if there are still rowspan numbers 'active'; no extra
    # rows to show in the table means the larger than 1 rowspan numbers in the
    # last table row are ignored.

    # build an empty matrix for all possible cells
    table = [[None] * colcount for row in rows]

    # fill matrix from row data
    rowspans = {}  # track pending rowspans, column number mapping to count
    for row, row_elem in enumerate(rows):
        span_offset = 0  # how many columns are skipped due to row and colspans 
        for col, cell in enumerate(row_elem.find_all(['td', 'th'], recursive=False)):
            # adjust for preceding row and colspans
            col += span_offset
            while rowspans.get(col, 0):
                span_offset += 1
                col += 1

            # fill table data
            rowspan = rowspans[col] = int(cell.get('rowspan', 1)) or len(rows) - row
            colspan = int(cell.get('colspan', 1)) or colcount - col
            # next column is offset by the colspan
            span_offset += colspan - 1
            value = cell.get_text()
            for drow, dcol in product(range(rowspan), range(colspan)):
                try:
                    table[row + drow][col + dcol] = value
                    rowspans[col + dcol] = rowspan
                except IndexError:
                    # rowspan or colspan outside the confines of the table
                    pass

        # update rowspan bookkeeping
        rowspans = {c: s - 1 for c, s in rowspans.items() if s > 1}

    return table

這會正確解析您的示例表：

>>> from pprint import pprint
>>> pprint(table_to_2d(soup.table), width=30)
[['1', '2', '5'],
 ['3', '4', '4'],
 ['3', '6', '7']]

並處理你的其他例子; 第一桌：

>>> table1 = BeautifulSoup('''
... <table border="1">
...   <tr>
...     <th>A</th>
...     <th>B</th>
...   </tr>
...   <tr>
...     <td rowspan="2">C</td>
...     <td rowspan="1">D</td>
...   </tr>
...   <tr>
...     <td>E</td>
...     <td>F</td>
...   </tr>
...   <tr>
...     <td>G</td>
...     <td>H</td>
...   </tr>
... </table>''', 'html.parser')
>>> pprint(table_to_2d(table1.table), width=30)
[['A', 'B', None],
 ['C', 'D', None],
 ['C', 'E', 'F'],
 ['G', 'H', None]]

第二個：

>>> table2 = BeautifulSoup('''
... <table border="1">
...   <tr>
...     <th>A</th>
...     <th>B</th>
...   </tr>
...   <tr>
...     <td rowspan="2">C</td>
...     <td rowspan="2">D</td>
...   </tr>
...   <tr>
...     <td>E</td>
...     <td>F</td>
...   </tr>
...   <tr>
...     <td>G</td>
...     <td>H</td>
...   </tr>
... </table>
... ''', 'html.parser')
>>> pprint(table_to_2d(table2.table), width=30)
[['A', 'B', None, None],
 ['C', 'D', None, None],
 ['C', 'D', 'E', 'F'],
 ['G', 'H', None, None]]

最后但並非最不重要的是，代碼正確處理超出實際表的跨度，並且"0"跨越（延伸到末尾），如下例所示：

<table border="1">
  <tr>
    <td rowspan="3">A</td>
    <td rowspan="0">B</td>
    <td>C</td>
    <td colspan="2">D</td>
  </tr>
  <tr>
    <td colspan="0">E</td>
  </tr>
</table>

有兩行4個單元格，即使你認為rowpan和colspan值可能有3和5：

+---+---+---+---+
|   |   | C | D |
| A | B +---+---+
|   |   |   E   |
+---+---+-------+

這樣的過度擴展就像瀏覽器一樣處理; 它們被忽略，0跨度擴展到剩余的行或列：

>>> span_demo = BeautifulSoup('''
... <table border="1">
...   <tr>
...     <td rowspan="3">A</td>
...     <td rowspan="0">B</td>
...     <td>C</td>
...     <td colspan="2">D</td>
...   </tr>
...   <tr>
...     <td colspan="0">E</td>
...   </tr>
... </table>''', 'html.parser')
>>> pprint(table_to_2d(span_demo.table), width=30)
[['A', 'B', 'C', 'D'],
 ['A', 'B', 'E', 'E']]

Answer 2

需要注意的是， Martijn Pieters解決方案並未考慮同時具有rowspan和colspan屬性的單元格的情況。 例如

<table border="1">
    <tr>
        <td rowspan="3" colspan="3">A</td>
        <td>B</td>
        <td>C</td>
        <td>D</td>
    </tr>
    <tr>
        <td colspan="3">E</td>
    </tr>
    <tr>
        <td colspan="1">E</td>
        <td>C</td>
        <td>C</td>
    </tr>
    <tr>
        <td colspan="1">E</td>
        <td>C</td>
        <td>C</td>
        <td>C</td>
        <td>C</td>
        <td>C</td>
    </tr>
</table>

此表呈現給

+-----------+---+---+---+
| A         | B | C | D |
|           +---+---+---+
|           | E         |
|           +---+---+---+
|           | E | C | C |
+---+---+---+---+---+---+
| E | C | C | C | C | C |
+---+---+---+---+---+---+

但是如果我們應用我們得到的功能

[['A', 'A', 'A', 'B', 'C', 'D'],
 ['A', 'E', 'E', 'E', None, None],
 ['A', 'E', 'C', 'C', None, None],
 ['E', 'C', 'C', 'C', 'C', 'C']]

可能存在一些邊緣情況，但是將rowspan簿記擴展到rowspan和colspan的product的單元格，即

   for drow, dcol in product(range(rowspan), range(colspan)):
            try:
                table[row + drow][col + dcol] = value
                rowspans[col + dcol] = rowspan
            except IndexError:
                # rowspan or colspan outside the confines of the table
                pass

似乎在這個線程中的示例，並為上面的表將輸出

[['A', 'A', 'A', 'B', 'C', 'D'],
 ['A', 'A', 'A', 'E', 'E', 'E'],
 ['A', 'A', 'A', 'E', 'C', 'C'],
 ['E', 'C', 'C', 'C', 'C', 'C']]

Answer 3

使用常規的遍歷方法只需將解析器類型更改為lxml。

soup = BeautifulSoup(resp.text, "lxml")

現在去解析它的常用方法。

如何使用rowspan和colspan解析表

問題描述

3 個解決方案

解決方案1
12 已采納 2018-01-25 20:09:43

解決方案2
2 2018-08-30 10:29:53

解決方案3
0 2019-03-13 08:31:21

如何使用rowspan和colspan解析表

問題描述

3 個解決方案

解決方案1 12 已采納 2018-01-25 20:09:43

解決方案2 2 2018-08-30 10:29:53

解決方案3 0 2019-03-13 08:31:21

解決方案1
12 已采納 2018-01-25 20:09:43

解決方案2
2 2018-08-30 10:29:53

解決方案3
0 2019-03-13 08:31:21