简体   繁体   English

Python:将带有 Unicode 的字符串转换为 HTML 数字代码

[英]Python: Convert String with Unicode to HTML numeric code

Hy guys i'm looking for a solution to convert all the unicodes contained in a string to the corresponding HTML entities.伙计们,我正在寻找一种解决方案,将字符串中包含的所有 unicode 转换为相应的 HTML 实体。

For instance:例如:

input: "This is \4. a string with \4. random \4. unicode"输入: “这是 \4。一个带有 \4 的字符串。随机 \4。unicode”
output: "This is $ a string with $ random $ unicode"输出: “这是 $ 一个带有 $ 随机 $ unicode 的字符串”

My current solution to this problem looks like:我目前对这个问题的解决方案如下:

if "\\u+" in my_string:
  unicode_code = (label_content.split("\\u+"))[1].split('.')[0]
  unicode_to_replace = f"\\u+{unicode_code}."
  unicode_string = f"U+{unicode_code}"
  html_code = unicode_string.encode('ascii', 'xmlcharrefreplace')
  my_string = label_content.replace(unicode_to_replace,  html_code)

But the unicode string is not converted in the right way, any suggestion?但是unicode字符串没有以正确的方式转换,有什么建议吗?

Thanks in advance!提前致谢!

Found a solution by myself, for anybody who's intrested in this.对于任何对此感兴趣的人,我自己找到了一个解决方案。 It differs a bit from what i've asked, the output does not show unicodes to html entities, but converts them to the corresponding char, because in my case this is better.它与我所问的有点不同,输出没有将 unicodes 显示为 html 实体,而是将它们转换为相应的字符,因为在我的情况下这更好。

So the final portion of code looks like this:所以代码的最后一部分是这样的:

# e.g. of an input string containing some sort of unicodes.
# This is how they are formatted in my input file.
my_string =  "This is \u+0024. a string with \u+0024. random \u+0024. unicodes" 

if "\\u+" in my_string :
  unicode_code = (my_string .split("\\u+"))[1].split('.')[0]
  unicode_to_replace = f"\\u+{unicode_code}."
  unicode = f"\\u{unicode_code}"
  # Where the actual unicode is converted to html entity
  html_entity = unicode.encode('utf-8').decode('raw-unicode-escape')
  my_string = my_string .replace(unicode_to_replace, html_entity)


print(my_string)
my_string >> "This is $ a string with $ random $ 

I'd prefer applying Regular expression operations ( re module) .我更喜欢应用正则表达式操作( re模块) The pattern variable covers pattern变量覆盖

  • all valid Unicode values (see eg U+042F instead of the middle U+0024 ),所有有效的 Unicode 值(参见例如U+042F而不是中间的U+0024 ),
  • all syntax versions of the input string: input variable in the original question was edited three times ( with/without leading backslash and/or trailing dot), and输入字符串的所有语法版本:原始问题中的input变量被编辑了 3 次(带/不带前导反斜杠和/或尾随点),以及
  • my_string variable in the OQ's self answer is incorrect: '\4' raises the truncated \\uXXXX escape error. OQ 的自我回答中的my_string变量不正确: '\4'引发了截断的 \\uXXXX 转义错误。

The script:剧本:

import re

def UPlusHtml(matchobj):
    return re.sub( r"^\\?[uU]\+", '&#x', 
             re.sub( r'\.$', '', matchobj.group(0) ) ) + ';';

def UPlusRepl(matchobj):
    return chr( int( re.sub( r"^\\?[uU]\+", '', 
                       re.sub( r'\.$', '', matchobj.group(0) ) ),16 ) );

pattern = r"(\\?[uU]\+[0-9a-fA-F]+\.?)"

input = "This is U+0024. a string with U+042f random U+0024. unicode"

print( input )
print( re.sub( pattern, UPlusHtml, input ) )
print( re.sub( pattern, UPlusRepl, input ) )

print('--')

my_string =  "This is \\u+0024. a string with \\u+042F random \\u+0024. unicodes"

print( my_string )
print( re.sub( pattern, UPlusHtml, my_string ) )
print( re.sub( pattern, UPlusRepl, my_string ) )

Output : \\SO\\67105976.py输出\\SO\\67105976.py

This is U+0024. a string with U+042f random U+0024. unicode
This is $ a string with Я random $ unicode
This is $ a string with Я random $ unicode
--
This is \u+0024. a string with \u+042F random \u+0024. unicodes
This is $ a string with Я random $ unicodes
This is $ a string with Я random $ unicodes

Please note that I'm a regex beginner myself so I believe that the must exist more efficient regex-based solution, without any doubt…请注意,我自己是一个正则表达式初学者,所以我相信必须存在更有效的基于正则表达式的解决方案,毫无疑问……

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM