如何从 HTML 文件中提取标签并将它们写入新文件？

Question

我的 HTML 文件的格式如下所示

<unit id="2" status="FINISHED" type="pe">

    <S producer="Alice_EN">CHAPTER I Down the Rabbit-Hole</S>

    <MT producer="ALICE_GG">CAPÍTULO I Abaixo do buraco de coelho</MT>

    <annotations revisions="1">

     <annotation r="1">
    

<PE producer="A1.ALICE_GG"><html>
 <head>

 </head>
 <body>
   CAPÍTULO I Descendo pela toca do coelho
  </body>
</html></PE>

我需要从整个 HTML 文件中的两个标签中提取所有内容。 以 <unit id...> 开头的标签之一的内容在一行中，但另一个以“<PE producer...”开头并以 '' 结尾的标签的内容分布在不同的行中. 我需要提取这两个标签内的内容，并将内容一个接一个地写入一个新文件。 我的 output 应该是：

<unit id="2" status="FINISHED" type="pe">

<PE producer="A1.ALICE_GG"><html>
<head>

</head>
<body>
  CAPÍTULO I Descendo pela toca do coelho
</body>
</html></PE>

我的代码没有从文件的所有标签中提取内容。 有没有人知道发生了什么以及如何使这段代码正常工作？

import codecs
import re

t=codecs.open('ALICE.per1_replaced.html','r')

t=t.read()


unitid=re.findall('<unit.*?"pe">', t)
PE=re.findall('<PE.*?</PE>', t, re.DOTALL)



for i in unitid:
    for j in PE:
        a=i + '\n' + j + '\n'
    with open('PEtags.txt','w') as fi:
        fi.write(a)

Answer 1

循环匹配项并将它们写入文件的代码有问题。

如果您的initid和PE匹配计数相同，您可以将代码调整为

import re

with open('ALICE.per1_replaced.html','r') as t:
  contents = t.read()
  unitid=re.findall('<unit.*?"pe">', contents)
  PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
  with open('PEtags.txt','w') as fi:
    for i, p in zip(unitid, PE):
      fi.write( "{}\n{}\n".format(i, p) )

如何从 HTML 文件中提取标签并将它们写入新文件？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-01 18:49:21

如何从 HTML 文件中提取标签并将它们写入新文件？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-01 18:49:21

解决方案1
1 已采纳 2021-02-01 18:49:21