简体   繁体   中英

Search for special HTML characters in text of lxml.html elements

Given an (un)ordered list I have to check if special HTML arrows are being used (and replace them with Latex arrows). lxml.html is a requirement.

I was tinkering around but then I couldn't get past the following:

import lxml.html

my_string = "<li>I have a dream &#8594; Hello!</li>"
elem = lxml.html.fromstring(my_string)

if "&#8594;" in my_string:    # True
    print("foo")

if "&#8594;" in elem.text:    # False
    print("bar")

I am unable to understand why the second if-condition evaluates to False. How can I check if (& #8594;) "→" exists in elem.text?

You need to search for a unicode representation of the :

>>> s = u"→"
>>> s
u'\u2192'

>>> import lxml.html
>>> 
>>> my_string = "<li>I have a dream &#8594; Hello!</li>"
>>> elem = lxml.html.fromstring(my_string)
>>> 
>>> if u'\u2192' in elem.text:
...     print("bar")
... 
bar

...and if you're looking to replace the character, import "re" like this:

import re
re.sub(u'\u2192', '&rarr;', my_string)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM