简体   繁体   English

Python列表搜索,元素比较和消除

[英]Python list search, comparison and elimination of elements

I would like to get all elements that don't have a pair. 我想获取所有没有配对的元素。 This is a list of XML tags as read from top to bottom with brackets removed. 这是从顶部到底部读取的XML标记列表,并且删除了括号。 I would like to find pairs (for ex. opening tag note and closing tag /note ), remove them from the list and then be left with tags that don't have pairs. 我想找到对(例如,打开标签note和关闭标签/note ),将它们从列表中删除,然后留下没有对的标签。

How do you iterate through the list, compare each tag with all other tags and say for example: aha, I found another 'note' tag that starts with forward slash? 您如何遍历列表,将每个标签与所有其他标签进行比较,例如说:啊哈,我发现了另一个以斜杠开头的“ note”标签?

Thanks. 谢谢。

Any other - better - ideas to find mismatching tags? 还有其他更好的想法来找到不匹配的标签吗?

PS: I do want the order of the list to be preserved and if possible, equality to be used when tag is compared to another tag in the list. PS:我确实希望保留列表的顺序,并且如果可能的话,将标记与列表中的另一个标记进行比较时要使用相等性。 If 'in' operator is used it won't work because in case tag name is one letter like 'a', then search will return all elements that contain a, not exact match for 'a'. 如果使用'in'运算符,它将无法工作,因为如果标签名称是一个字母,例如'a',则搜索将返回所有包含a的元素,而不是与'a'完全匹配的元素。

tags = ['note', 'to', 'bbb', 'bbb', 'firstname', '/firstname', 'lastname', '/lastname', 'from', 'hello', 'hello', 'hello', 'hello', 'hello', 'l', '/from', '/to', 'elephant', 'll', 'from', '/from', 'a1', 'img', 'a2', 'from', 'from', '/from', '/from', '/a2', '/img', '/a1', 'heading', '/heading', 'body', '/body', '/note']

You could create a set with all the closing tags and then use that set to filter the tags. 您可以创建一个set所有的结束标记,然后用它设置为过滤标签。

>>> closing = set([t for t in tags if t.startswith("/")])
>>> [t for t in tags if "/" + t not in closing and t not in closing]
['bbb', 'bbb', 'hello', 'hello', 'hello', 'hello', 'hello', 'l', 'elephant', 'll']

Note, however, that this will not really respect "pairs" of tags, but just see whether there is a "closing" variant of the same tag in the list. 但是请注意,这不会真正考虑标签的“成对”,而只是查看列表中是否存在相同标签的“关闭”变体。 For instance, given tags = ["a", "a", "/a"] or tags = ["a", "/a", "a"] , it will remove both instances of a from the list. 例如,给定tags = ["a", "a", "/a"]tags = ["a", "/a", "a"] ,它会删除的两个实例a从列表。

First part of the program gets all the tags in a list. 程序的第一部分将所有标签放在列表中。 If you notice it is the problem of finding non-matching brackets. 如果您注意到这是找到不匹配括号的问题。 It can be solved by considering that list as stack, and finding which tags are faulty , iterating along the way. 可以通过将该列表视为堆栈,并找出哪些标签有问题并在此过程中进行迭代来解决。

import re

def clean_attr(attr):
    attr_list = re.split(r'\s+', attr)
    if len(attr_list) == 1:
        return attr
    else:
        return attr_list[0] + '>'

line="""
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.</description>
   </book>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description>
   </book>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genre>Horror</genre>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   </book>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.</description>
   </book>
   <book id="bk110">
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.</description>
   </book>
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.
   </book>
</catalog>

"""
attr_open = re.findall(r'<[\w+\s=\"]+>', line)
attr_closed = re.findall(r'<\/\w+>', line)
all_attrs = re.findall(r'<[\w+\s=\"]+>|<\/\w+>', line)

all_attrs_cleaned = map(clean_attr, all_attrs)

# print all_attrs_cleaned

list_as_stack = []
not_closed = []
all_attrs_cleaned = iter(all_attrs_cleaned)

an_attr = all_attrs_cleaned.next()

try:
    while all_attrs_cleaned:
        if not an_attr.startswith('</'):
            list_as_stack.append(an_attr)
            an_attr = all_attrs_cleaned.next()
        else:
            temp = list_as_stack[-1]
            if re.search(r'\w+', temp).group(0) == re.search(r'\w+', an_attr).group(0):
                list_as_stack.pop()
                an_attr = all_attrs_cleaned.next()
            else:
                if len(list_as_stack) != 0:
                    not_closed.append(an_attr)  
                an_attr = all_attrs_cleaned.next()
except Exception:
    print "Stop Iter"

print list_as_stack
print not_closed

In above program, the first array tells you, which tags are not closed, and second array tells you which closing tags do not have opening tags. 在上面的程序中,第一个数组告诉您哪些标签没有关闭,第二个数组告诉您哪些结束标签没有打开标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM