简体   繁体   中英

Python: Convert String with Unicode to HTML numeric code

Hy guys i'm looking for a solution to convert all the unicodes contained in a string to the corresponding HTML entities.

For instance:

"This is \4. a string with \4. random \4. unicode" “这是 \4。一个带有 \4 的字符串。随机 \4。unicode”
"This is $ a string with $ random $ unicode" “这是 $ 一个带有 $ 随机 $ unicode 的字符串”

My current solution to this problem looks like:

if "\\u+" in my_string:
  unicode_code = (label_content.split("\\u+"))[1].split('.')[0]
  unicode_to_replace = f"\\u+{unicode_code}."
  unicode_string = f"U+{unicode_code}"
  html_code = unicode_string.encode('ascii', 'xmlcharrefreplace')
  my_string = label_content.replace(unicode_to_replace,  html_code)

But the unicode string is not converted in the right way, any suggestion?

Thanks in advance!

Found a solution by myself, for anybody who's intrested in this. It differs a bit from what i've asked, the output does not show unicodes to html entities, but converts them to the corresponding char, because in my case this is better.

So the final portion of code looks like this:

# e.g. of an input string containing some sort of unicodes.
# This is how they are formatted in my input file.
my_string =  "This is \u+0024. a string with \u+0024. random \u+0024. unicodes" 

if "\\u+" in my_string :
  unicode_code = (my_string .split("\\u+"))[1].split('.')[0]
  unicode_to_replace = f"\\u+{unicode_code}."
  unicode = f"\\u{unicode_code}"
  # Where the actual unicode is converted to html entity
  html_entity = unicode.encode('utf-8').decode('raw-unicode-escape')
  my_string = my_string .replace(unicode_to_replace, html_entity)


print(my_string)
my_string >> "This is $ a string with $ random $ 

I'd prefer applying Regular expression operations ( re module) . The pattern variable covers

  • all valid Unicode values (see eg U+042F instead of the middle U+0024 ),
  • all syntax versions of the input string: input variable in the original question was edited three times ( with/without leading backslash and/or trailing dot), and
  • my_string variable in the OQ's self answer is incorrect: '\4' raises the truncated \\uXXXX escape error.

The script:

import re

def UPlusHtml(matchobj):
    return re.sub( r"^\\?[uU]\+", '&#x', 
             re.sub( r'\.$', '', matchobj.group(0) ) ) + ';';

def UPlusRepl(matchobj):
    return chr( int( re.sub( r"^\\?[uU]\+", '', 
                       re.sub( r'\.$', '', matchobj.group(0) ) ),16 ) );

pattern = r"(\\?[uU]\+[0-9a-fA-F]+\.?)"

input = "This is U+0024. a string with U+042f random U+0024. unicode"

print( input )
print( re.sub( pattern, UPlusHtml, input ) )
print( re.sub( pattern, UPlusRepl, input ) )

print('--')

my_string =  "This is \\u+0024. a string with \\u+042F random \\u+0024. unicodes"

print( my_string )
print( re.sub( pattern, UPlusHtml, my_string ) )
print( re.sub( pattern, UPlusRepl, my_string ) )

Output : \\SO\\67105976.py

This is U+0024. a string with U+042f random U+0024. unicode
This is $ a string with Я random $ unicode
This is $ a string with Я random $ unicode
--
This is \u+0024. a string with \u+042F random \u+0024. unicodes
This is $ a string with Я random $ unicodes
This is $ a string with Я random $ unicodes

Please note that I'm a regex beginner myself so I believe that the must exist more efficient regex-based solution, without any doubt…

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM