如果存在其他標簽，如何將鍵值對提取到字典中？

Question

我正在嘗試使用 python 從 HTML 表中提取 Key/Value{1,2} 對並將它們拉入字典。

表格元素看起來並不總是相同，這就是我提出問題的原因。

一個最小的例子：

  <div class="grabme">
  <table>
     <tbody>

        <tr>
           <td colspan="2">
              <p class="1st 2nd 3rd">
                 Box Headline</p>
           </td>
        </tr>

        <tr>
           <td><strong>First Key</strong></td>
           <td><span>Value</span></td>
           <script>
           </script>
        </tr>

        <tr>
              <td><strong>2. Key</strong></td>
              <td><a>Value</a><br></td>
        </tr>

        <tr>
           <td><strong>3. Key</strong></td>
           <td>Value</td>
        </tr>

        <tr>
           <td><strong>4. Key</strong></td>
           <td>
           <a >Val 1</a>
              Val 2

              <script>
                    $(document).ready(function () {
                       $('.class').click(function (e) {
                          e.bla();
                          sel.bla('/bla/bla', {
                                bla: true
                             }
                          );
                       });
                    });
                 </script>
              </td>
        </tr>

        <tr>
              <td><strong>5. Key</strong></td>
              <td>
                 <i></i>
                 Value
              </td>
        </tr>

     </tbody>

     <tbody>
        <tr>
           <td colspan="2">
              <p class="">
                 Heading 2</p>
           </td>
        </tr>

        <tr>
           <td><strong>6. Key</strong></td>
           <td>Value</td>
        </tr>
     </tbody>
  </table>

獲取密鑰很容易：

keys = response.xpath('//div[@class="grabme"]/table/tbody/tr/td/strong/text()').extract()

不幸的是，我無法獲得示例中的所有密鑰，因為密鑰 6 位於新的 tbody 中。 但是作為一個黑客，我可以單獨獲得它並在以后附加到 dict 。

獲取值要困難得多。 我最好的鏡頭是這樣的：

values = [remove_tags(w).strip() for w in response.xpath('//div[@class="grabme"]/table/tbody/tr/td[1]/text()').extract()]

不幸的是，由於額外的 html 標簽，這不起作用。 如果我能夠獲取所有值，那么我可以將它們拉入字典：

dict = {first: second for first, second in zip(keys, values)}

這部分也可能很棘手，因為示例顯示 Key 4 有 2 個值。 可以用分隔符將它們放入一個值中，以便我稍后進行相應的處理。

如何獲取示例中的值，甚至更好，是否有更智能的方法來獲取所有所需鍵值對的字典？

由於結構不同，這次嘗試失敗了：

cells = response.xpath('//div[@class="grabme"]/tbody/tr/td/text()').extract()
dict = {first: second for first, second in zip(cells[::2], cells[1::2])}

Answer 1

您可以嘗試使用此 XPath 來匹配鍵和值：

//div[@class="grabme"]//td/strong/text() | //div[@class="grabme"]//td[strong]/following-sibling::td//text()[normalize-space() and (parent::td or parent::a or parent::span)]

或將其拆分為

//div[@class="grabme"]//td/strong/text()  # keys
//div[@class="grabme"]//td[strong]/following-sibling::td//text()[normalize-space() and (parent::td or parent::a or parent::span)]  # values

更新

items = {}
for row in response.xpath('//div[@class="grabme"]//tr[td[strong]]'):
    items[row.xpath('./td/strong/text()').extract_first()] = [td.strip() for td in row.xpath('./td[strong]/following-sibling::td//text()[normalize-space() and (parent::td or parent::a or parent::span)]').extract()]

如果存在其他標簽，如何將鍵值對提取到字典中？

問題描述

1 個解決方案

解決方案1
1 已采納 2018-12-23 09:36:11

如果存在其他標簽，如何將鍵值對提取到字典中？

問題描述

1 個解決方案

解決方案1 1 已采納 2018-12-23 09:36:11

解決方案1
1 已采納 2018-12-23 09:36:11