Python：将带有 Unicode 的字符串转换为 HTML 数字代码

Question

伙计们，我正在寻找一种解决方案，将字符串中包含的所有 unicode 转换为相应的 HTML 实体。

例如：

输入： “这是 \4。一个带有 \4 的字符串。随机 \4。unicode”
输出： “这是 $ 一个带有 $ 随机 $ unicode 的字符串”

我目前对这个问题的解决方案如下：

if "\\u+" in my_string:
  unicode_code = (label_content.split("\\u+"))[1].split('.')[0]
  unicode_to_replace = f"\\u+{unicode_code}."
  unicode_string = f"U+{unicode_code}"
  html_code = unicode_string.encode('ascii', 'xmlcharrefreplace')
  my_string = label_content.replace(unicode_to_replace,  html_code)

但是unicode字符串没有以正确的方式转换，有什么建议吗？

提前致谢！

Answer 1

对于任何对此感兴趣的人，我自己找到了一个解决方案。 它与我所问的有点不同，输出没有将 unicodes 显示为 html 实体，而是将它们转换为相应的字符，因为在我的情况下这更好。

所以代码的最后一部分是这样的：

# e.g. of an input string containing some sort of unicodes.
# This is how they are formatted in my input file.
my_string =  "This is \u+0024. a string with \u+0024. random \u+0024. unicodes" 

if "\\u+" in my_string :
  unicode_code = (my_string .split("\\u+"))[1].split('.')[0]
  unicode_to_replace = f"\\u+{unicode_code}."
  unicode = f"\\u{unicode_code}"
  # Where the actual unicode is converted to html entity
  html_entity = unicode.encode('utf-8').decode('raw-unicode-escape')
  my_string = my_string .replace(unicode_to_replace, html_entity)


print(my_string)
my_string >> "This is $ a string with $ random $

Answer 2

我更喜欢应用正则表达式操作（ re模块）。 pattern变量覆盖

所有有效的 Unicode 值（参见例如U+042F而不是中间的U+0024 ），
输入字符串的所有语法版本：原始问题中的input变量被编辑了 3 次（带/不带前导反斜杠和/或尾随点），以及
OQ 的自我回答中的my_string变量不正确： '\4'引发了截断的 \\uXXXX 转义错误。

剧本：

import re

def UPlusHtml(matchobj):
    return re.sub( r"^\\?[uU]\+", '&#x', 
             re.sub( r'\.$', '', matchobj.group(0) ) ) + ';';

def UPlusRepl(matchobj):
    return chr( int( re.sub( r"^\\?[uU]\+", '', 
                       re.sub( r'\.$', '', matchobj.group(0) ) ),16 ) );

pattern = r"(\\?[uU]\+[0-9a-fA-F]+\.?)"

input = "This is U+0024. a string with U+042f random U+0024. unicode"

print( input )
print( re.sub( pattern, UPlusHtml, input ) )
print( re.sub( pattern, UPlusRepl, input ) )

print('--')

my_string =  "This is \\u+0024. a string with \\u+042F random \\u+0024. unicodes"

print( my_string )
print( re.sub( pattern, UPlusHtml, my_string ) )
print( re.sub( pattern, UPlusRepl, my_string ) )

输出： \\SO\\67105976.py

This is U+0024. a string with U+042f random U+0024. unicode
This is &#x0024; a string with &#x042f; random &#x0024; unicode
This is $ a string with Я random $ unicode
--
This is \u+0024. a string with \u+042F random \u+0024. unicodes
This is &#x0024; a string with &#x042F; random &#x0024; unicodes
This is $ a string with Я random $ unicodes

请注意，我自己是一个正则表达式初学者，所以我相信必须存在更有效的基于正则表达式的解决方案，毫无疑问……

Python：将带有 Unicode 的字符串转换为 HTML 数字代码

问题描述

2 个解决方案

解决方案1
0 2021-04-19 08:14:07

解决方案2
0 已采纳 2021-04-19 18:02:13

Python：将带有 Unicode 的字符串转换为 HTML 数字代码

问题描述

2 个解决方案

解决方案1 0 2021-04-19 08:14:07

解决方案2 0 已采纳 2021-04-19 18:02:13

解决方案1
0 2021-04-19 08:14:07

解决方案2
0 已采纳 2021-04-19 18:02:13