简体   繁体   English

使用 python 从 xml 中提取标签

[英]Extract tags from xml using python

I'm trying to extract tags from an XML file using RE in Python.我正在尝试使用 Python 中的 RE 从 XML 文件中提取标签。 I need to extract nodes that start with tag "< PE" and their corresponding Unit IDs which are nodes above each tag "<PE".我需要提取以标记“<PE”开头的节点及其对应的单元 ID,这些单元 ID 是每个标记“<PE”上方的节点。 The file can be seen here该文件可以在这里看到

When I use the below code, I don't get the correct tags "<unit IDs", that is, the ones that correspond to each tag "<PE".当我使用下面的代码时,我没有得到正确的标签“<unit IDs”,即与每个标签“<PE”相对应的标签。 For example, in my output, the content extracted from tag "<PE" with "<Unit ID=250" is actually "<Unit ID=149" in the original file.例如,在我的output中,从标签“<PE”中提取的“<Unit ID=250”的内容实际上是原始文件中的“<Unit ID=149”。 Besides, the code skips some tags "<Unit ID".此外,代码跳过了一些标签“<Unit ID”。 Does anyone see in my code where's the error?有人在我的代码中看到错误在哪里吗?

import re

t=open('ALICE.per1_replaced.txt','r')

t=t.read()




unitid=re.findall('<unit.*?"pe">', t, re.DOTALL)
PE=re.findall("<PE.*?</PE>", t, re.DOTALL)


a=zip(unitid,PE)

tp=tuple(a)


w=open('Tags.txt','w')

for x, j in tp:
    a=x + '\n'+j + '\n'

    w.write(a)



w.close()

I've tried this version as well but I had the same problems:我也试过这个版本,但我遇到了同样的问题:

with open('ALICE.per1_replaced.txt','r') as t:
  contents = t.read()

unitid=re.findall('<unit.*?"pe">', contents,  re.DOTALL)
PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
with open('PEtagsper1.txt','w') as fi:
    for i, p in zip(unitid, PE):
        fi.write( "{}\n{}\n".format(i, p))

my desired output is a file with tags "<Unit ID=" followed by the content within the tag that starts with "<PE" and ends with "" as below:我想要的 output 是一个带有标签“<Unit ID =”的文件,后跟标签中以“<PE”开头并以“”结尾的内容,如下所示:

<unit id="16" status="FINISHED" type="pe">
<PE producer="A1.ALICE_GG"><html>
  <head>

  </head>
  <body>
    Eu vou me atrasar!' (quando ela voltou a pensar sobre isso mais trade, 
    ocorreu-lhe que deveria ter achado isso curioso, mas na hora tudo pareceu 
    bastante natural); mas quando o Coelho de fato tirou um relógio do bolso 
    do colete e olhou-o, e então se apressou, Alice pôs-se de pé, pois lhe 
    ocorreu que nunca antes vira um coelho com um colete, ou com um relógio de 
    bolso pra tirar, e queimando de curiosidade, ela atravessou o campo atrás 
    dele correndo e, felizmente, chegou justo a tempo de vê-lo entrar dentro 
    de uma grande toca de coelho sob a cerca.
  </body>
</html></PE>

You seem to have multiple tags under each tag (eg, for unit 3), thus the zip doesn't work correctly.您似乎在每个标签下都有多个标签(例如,对于单元 3),因此 zip 无法正常工作。 As @Error_2646 noted in comments, some XML or beautiful soup package would work better for this job.正如@Error_2646 在评论中指出的那样,一些 XML 或漂亮的汤 package 更适合这项工作。

But if for whatever reason you want to stick to regex, you can fix this by running a regex on the list of strings returned by the first regex.但是,如果出于某种原因您想坚持使用正则表达式,您可以通过在第一个正则表达式返回的字符串列表上运行正则表达式来解决此问题。 Example code that worked on the small part of the input I downloaded:适用于我下载的一小部分输入的示例代码:

units=re.findall('<unit.*?</unit>', t, re.DOTALL)
unitList = []
for unit in units:
    #first get your unit regex
    unitid=re.findall('<unit.*?"pe">', unit, re.DOTALL) # same as the one you use
    #there should only be one within each
    assert (len(unitid) == 1)
    #now find all pes for this unit
    PE=re.findall("<PE.*?</PE>", unit, re.DOTALL) # same as the one you use
    # combine results
    output = unitid[0] + "\n"
    for pe in PE:
        output += pe + "\n"
    unitList.append(output)

for x in unitList:
    print(x)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM