I am trying to extract the text from an html file of the following structure:
<td class='srctext>
<pre>
<b> Heading 1 </b>
text
more text
<b> Heading 2 </b>
even more text,
<b> also some bold text </b>
and the last text
</pre>
To do this i'm using xpath, like
//td[@class='srctext]/pre/b
Doing this i get the inner text of all bold tags, and i can also get the entire inner text of pre, by using the string() wrapper.
However what i am struggling to do, is getting a result like:
[
'Heading 1',
'text \n more text',
'Heading 2',
'even more text',
...
]
Please don't hesitate to ask if anything is unclear.
Try //td[@class='srctext']/pre//text()[normalize-space()]
as the XPath (assuming you have full XPath 1.0 support with eg lxml and not restricted ElementTree XPath support).
Full example is
from lxml import etree as ET
html = '''<html><body><table><tr><td class=srctext>
<pre>
<b> Heading 1 </b>
text
more text
<b> Heading 2 </b>
even more text,
<b> also some bold text </b>
and the last text
</pre>
</body>
</html>'''
htmlEl = ET.HTML(html)
textValues = htmlEl.xpath("//td[@class='srctext']/pre//text()[normalize-space()]")
print(textValues)
and outputs
[' Heading 1 ', '\n text\n more text\n ', ' Heading 2 ', '\n even more text, \n ', ' also some bold text ', '\n and the last text\n']
If I correctly understand your question, you want to ignore the html struture and extract pieces of text in a list, each list element being a string not containing any tag.
Normally using regexes to parse XML or HTML is a terrible idea, but this question is one of the rare uses cases for it. Assuming you have read all the file in a single string:
[ i.strip() for i in re.findall(r'(.*?)<.*?>', t, re.DOTALL) if len(i.strip()) > 0]
gives as expected:
['Heading 1', 'text\n more text', 'Heading 2', 'even more text,', 'also some bold text', 'and the last text']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.