简体   繁体   中英

Extracting text from HTML with interspersed bold tags, preserving order

I am trying to extract the text from an html file of the following structure:

<td class='srctext>
<pre>
    <b> Heading 1 </b>
    text
    more text
    <b> Heading 2 </b>
    even more text, 
    <b> also some bold text </b>
    and the last text
</pre>

To do this i'm using xpath, like

//td[@class='srctext]/pre/b

Doing this i get the inner text of all bold tags, and i can also get the entire inner text of pre, by using the string() wrapper.

However what i am struggling to do, is getting a result like:

[
  'Heading 1',
  'text \n more text',
  'Heading 2',
  'even more text',
  ...
]

Please don't hesitate to ask if anything is unclear.

Try //td[@class='srctext']/pre//text()[normalize-space()] as the XPath (assuming you have full XPath 1.0 support with eg lxml and not restricted ElementTree XPath support).

Full example is

from lxml import etree as ET
html = '''<html><body><table><tr><td class=srctext>
<pre>
    <b> Heading 1 </b>
    text
    more text
    <b> Heading 2 </b>
    even more text, 
    <b> also some bold text </b>
    and the last text
</pre>
</body>
</html>'''

htmlEl = ET.HTML(html)
textValues = htmlEl.xpath("//td[@class='srctext']/pre//text()[normalize-space()]")
print(textValues)

and outputs

[' Heading 1 ', '\n    text\n    more text\n    ', ' Heading 2 ', '\n    even more text, \n    ', ' also some bold text ', '\n    and the last text\n']

If I correctly understand your question, you want to ignore the html struture and extract pieces of text in a list, each list element being a string not containing any tag.

Normally using regexes to parse XML or HTML is a terrible idea, but this question is one of the rare uses cases for it. Assuming you have read all the file in a single string:

[ i.strip() for i in re.findall(r'(.*?)<.*?>', t, re.DOTALL) if len(i.strip()) > 0]

gives as expected:

['Heading 1', 'text\n    more text', 'Heading 2', 'even more text,', 'also some bold text', 'and the last text']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM