简体   繁体   English

如何在XML / HTML中找到重复元素的结构

[英]How to find the structure of repeating elements in XML/HTML

I'm currently trying to solve a programming problem. 我目前正在尝试解决编程问题。 I'm trying to find repeated structures in any HTML page, and I'm trying to retrieve the values of those elements. 我试图在任何HTML页面中找到重复的结构,我正在尝试检索这些元素的值。

For example, I've got a HTML page with repeating elements, like the following: 例如,我有一个带有重复元素的HTML页面,如下所示:

<html>
<body>
  <ul>
     <li>green</li>
     <li>orange</li>
     <li>red</li>
  </ul>
</body>

In this code, I'd like to detect that there's a repeating block (the 'li' items), and I'd like to extract their values. 在这段代码中,我想检测到有一个重复块('li'项),我想提取它们的值。 Another HTML example: 另一个HTML示例:

<table>
   <tr>
      <td>1</td>
      <td>John</td>
   </tr>
   <tr>
      <td>2</td>
      <td>Simon</td>
   </tr>
</table>

In this example I'd like to detect that the structure is repeated, and get the values [1,John] and [2,Simon] from that. 在这个例子中,我想检测结构是否重复,并从中获取值[1,John]和[2,Simon]。

My question is: is there a simple algorithm to do something like this, or, if not, how would you approach something like this? 我的问题是:有没有一个简单的算法来做这样的事情,或者,如果没有,你会如何处理这样的事情?

A rather rudimentary python program that detects the duplicated tr-td-td tag sequence and duplicated td tags is shown below. 检测重复的tr-td-td标签序列和重复的td标签的相当基本的python程序如下所示。 With your second html example saved in file xml.html , the program prints out: 将第二个html示例保存在文件xml.html ,程序打印出来:

tr.td.td 

td 1
td John
tr.td.td 

td 2
td Simon
Counter({'td': 4, 'tr.td.td': 2, 'table.tr.tr': 1})
#!/usr/bin/env python
from xml.etree import cElementTree as ET
from collections import Counter

def sot(r, depth):
    tags = r.tag
    for e in r.getchildren():
        tags += '.' + sot(e, depth+1)
    r.tail = tags
    cc[r.tail] += 1
    return r.tag

def tot(r, depth):
    if cc[r.tail] > 1:
        print r.tail, r.text
    for e in r.getchildren():
        tot(e, depth+1)

cc = Counter()
p=ET.parse ("xml.html")
sot(p.getroot(), 0)
tot(p.getroot(), 0)
print cc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM