简体   繁体   English

在Python中使用提取XML标签内的文本(同时避免 <p> 标签)

[英]Extract text inside XML tags with in Python (while avoiding <p> tags)

I'm working with the NYT corpus in Python and attempting to extract only what's located inside "full_text" class of every .xml article file. 我正在使用Python中的NYT语料库,尝试仅提取每个.xml文章文件的“ full_text”类中的内容。 For example: 例如:

<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>

Ideally, I'd like to parse out only the string, yielding "LEAD: Two police officers responding to a reported robbery..." but I'm unsure of what the best approach would be. 理想情况下,我只想分析字符串,得出“线索:两名警务人员对举报的抢劫案作出回应……”,但我不确定最好的方法是什么。 Is this something that can be easily parsed by regex? 这是正则表达式可以轻松解析的东西吗? If so, nothing I've attempted seems to work. 如果是这样,我尝试过的一切似乎都没有效果。

Any advice would be appreciated! 任何意见,将不胜感激!

Is this something that can be easily parsed by regex? 这是正则表达式可以轻松解析的东西吗?

Dont'! 别'!

Use an xml parser like lxml . 使用像lxml这样的xml解析器。

ex = """
<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
</body.content>"""

from lxml import etree
ex = etree.fromstring(ex)
print ex.findtext('./block/p')

Output: 输出:

LEAD: Two police officers responding to a reported robbery at a 
Brooklyn tavern early yesterday were themselves held up by the robbers, who
took their revolvers and herded them into a back room with patrons, the 
police said.

You could use BeautifulSoup parser also. 您也可以使用BeautifulSoup解析器。

>>> from bs4 import BeautifulSoup
>>> s = '''<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>'''
>>> soup = BeautifulSoup(s)
>>> for i in soup.findAll('block', class_="full_text"):
        print(i.text)



LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM