Hy guys i'm looking for a solution to convert all the unicodes contained in a string to the corresponding HTML entities.
For instance:
"This is \4. a string with \4. random \4. unicode" “这是 \4。一个带有 \4 的字符串。随机 \4。unicode”
"This is $ a string with $ random $ unicode" “这是 $ 一个带有 $ 随机 $ unicode 的字符串”
My current solution to this problem looks like:
if "\\u+" in my_string:
unicode_code = (label_content.split("\\u+"))[1].split('.')[0]
unicode_to_replace = f"\\u+{unicode_code}."
unicode_string = f"U+{unicode_code}"
html_code = unicode_string.encode('ascii', 'xmlcharrefreplace')
my_string = label_content.replace(unicode_to_replace, html_code)
But the unicode string is not converted in the right way, any suggestion?
Thanks in advance!
Found a solution by myself, for anybody who's intrested in this. It differs a bit from what i've asked, the output does not show unicodes to html entities, but converts them to the corresponding char, because in my case this is better.
So the final portion of code looks like this:
# e.g. of an input string containing some sort of unicodes.
# This is how they are formatted in my input file.
my_string = "This is \u+0024. a string with \u+0024. random \u+0024. unicodes"
if "\\u+" in my_string :
unicode_code = (my_string .split("\\u+"))[1].split('.')[0]
unicode_to_replace = f"\\u+{unicode_code}."
unicode = f"\\u{unicode_code}"
# Where the actual unicode is converted to html entity
html_entity = unicode.encode('utf-8').decode('raw-unicode-escape')
my_string = my_string .replace(unicode_to_replace, html_entity)
print(my_string)
my_string >> "This is $ a string with $ random $
I'd prefer applying Regular expression operations ( re
module) . The pattern
variable covers
U+042F
instead of the middle U+0024
),input
variable in the original question was edited three times ( with/without leading backslash and/or trailing dot), andmy_string
variable in the OQ's self answer is incorrect: '\4'
raises the truncated \\uXXXX escape error. The script:
import re
def UPlusHtml(matchobj):
return re.sub( r"^\\?[uU]\+", '&#x',
re.sub( r'\.$', '', matchobj.group(0) ) ) + ';';
def UPlusRepl(matchobj):
return chr( int( re.sub( r"^\\?[uU]\+", '',
re.sub( r'\.$', '', matchobj.group(0) ) ),16 ) );
pattern = r"(\\?[uU]\+[0-9a-fA-F]+\.?)"
input = "This is U+0024. a string with U+042f random U+0024. unicode"
print( input )
print( re.sub( pattern, UPlusHtml, input ) )
print( re.sub( pattern, UPlusRepl, input ) )
print('--')
my_string = "This is \\u+0024. a string with \\u+042F random \\u+0024. unicodes"
print( my_string )
print( re.sub( pattern, UPlusHtml, my_string ) )
print( re.sub( pattern, UPlusRepl, my_string ) )
Output : \\SO\\67105976.py
This is U+0024. a string with U+042f random U+0024. unicode
This is $ a string with Я random $ unicode
This is $ a string with Я random $ unicode
--
This is \u+0024. a string with \u+042F random \u+0024. unicodes
This is $ a string with Я random $ unicodes
This is $ a string with Я random $ unicodes
Please note that I'm a regex beginner myself so I believe that the must exist more efficient regex-based solution, without any doubt…
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.