在Python中使用提取XML标签内的文本（同时避免 <p> 标签）

Question

I'm working with the NYT corpus in Python and attempting to extract only what's located inside "full_text" class of every .xml article file. 我正在使用Python中的NYT语料库，尝试仅提取每个.xml文章文件的“ full_text”类中的内容。 For example: 例如：

<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>

Ideally, I'd like to parse out only the string, yielding "LEAD: Two police officers responding to a reported robbery..." but I'm unsure of what the best approach would be. 理想情况下，我只想分析字符串，得出“线索：两名警务人员对举报的抢劫案作出回应……”，但我不确定最好的方法是什么。 Is this something that can be easily parsed by regex? 这是正则表达式可以轻松解析的东西吗？ If so, nothing I've attempted seems to work. 如果是这样，我尝试过的一切似乎都没有效果。

Any advice would be appreciated! 任何意见，将不胜感激！

Answer 1

Is this something that can be easily parsed by regex? 这是正则表达式可以轻松解析的东西吗？

Dont'! 别'！

Use an xml parser like lxml . 使用像lxml这样的xml解析器。

ex = """
<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
</body.content>"""

from lxml import etree
ex = etree.fromstring(ex)
print ex.findtext('./block/p')

Output: 输出：

LEAD: Two police officers responding to a reported robbery at a 
Brooklyn tavern early yesterday were themselves held up by the robbers, who
took their revolvers and herded them into a back room with patrons, the 
police said.

Answer 2

You could use BeautifulSoup parser also. 您也可以使用BeautifulSoup解析器。

>>> from bs4 import BeautifulSoup
>>> s = '''<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>'''
>>> soup = BeautifulSoup(s)
>>> for i in soup.findAll('block', class_="full_text"):
        print(i.text)



LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.

在Python中使用提取XML标签内的文本（同时避免 <p> 标签）

问题描述

2 个解决方案

解决方案1
0 2015-03-23 02:52:36

解决方案2
0 已采纳 2015-03-23 03:31:40

在Python中使用提取XML标签内的文本（同时避免 <p> 标签）

问题描述

2 个解决方案

解决方案1 0 2015-03-23 02:52:36

解决方案2 0 已采纳 2015-03-23 03:31:40

解决方案1
0 2015-03-23 02:52:36

解决方案2
0 已采纳 2015-03-23 03:31:40