如何在XML / HTML中找到重復元素的結構

Question

我目前正在嘗試解決編程問題。 我試圖在任何HTML頁面中找到重復的結構，我正在嘗試檢索這些元素的值。

例如，我有一個帶有重復元素的HTML頁面，如下所示：

<html>
<body>
  <ul>
     <li>green</li>
     <li>orange</li>
     <li>red</li>
  </ul>
</body>

在這段代碼中，我想檢測到有一個重復塊（'li'項），我想提取它們的值。 另一個HTML示例：

<table>
   <tr>
      <td>1</td>
      <td>John</td>
   </tr>
   <tr>
      <td>2</td>
      <td>Simon</td>
   </tr>
</table>

在這個例子中，我想檢測結構是否重復，並從中獲取值[1，John]和[2，Simon]。

我的問題是：有沒有一個簡單的算法來做這樣的事情，或者，如果沒有，你會如何處理這樣的事情？

Answer 1

檢測重復的tr-td-td標簽序列和重復的td標簽的相當基本的python程序如下所示。 將第二個html示例保存在文件xml.html ，程序打印出來：

tr.td.td 

td 1
td John
tr.td.td 

td 2
td Simon
Counter({'td': 4, 'tr.td.td': 2, 'table.tr.tr': 1})

#!/usr/bin/env python
from xml.etree import cElementTree as ET
from collections import Counter

def sot(r, depth):
    tags = r.tag
    for e in r.getchildren():
        tags += '.' + sot(e, depth+1)
    r.tail = tags
    cc[r.tail] += 1
    return r.tag

def tot(r, depth):
    if cc[r.tail] > 1:
        print r.tail, r.text
    for e in r.getchildren():
        tot(e, depth+1)

cc = Counter()
p=ET.parse ("xml.html")
sot(p.getroot(), 0)
tot(p.getroot(), 0)
print cc

如何在XML / HTML中找到重復元素的結構

問題描述

1 個解決方案

解決方案1
2 2012-11-03 19:52:25

如何在XML / HTML中找到重復元素的結構

問題描述

1 個解決方案

解決方案1 2 2012-11-03 19:52:25

解決方案1
2 2012-11-03 19:52:25