[英]Python: Convert String with Unicode to HTML numeric code
Hy guys i'm looking for a solution to convert all the unicodes contained in a string to the corresponding HTML entities.伙计们,我正在寻找一种解决方案,将字符串中包含的所有 unicode 转换为相应的 HTML 实体。
For instance:例如:
input: "This is \4. a string with \4. random \4. unicode"输入: “这是 \4。一个带有 \4 的字符串。随机 \4。unicode”
output: "This is $ a string with $ random $ unicode"输出: “这是 $ 一个带有 $ 随机 $ unicode 的字符串”
My current solution to this problem looks like:我目前对这个问题的解决方案如下:
if "\\u+" in my_string:
unicode_code = (label_content.split("\\u+"))[1].split('.')[0]
unicode_to_replace = f"\\u+{unicode_code}."
unicode_string = f"U+{unicode_code}"
html_code = unicode_string.encode('ascii', 'xmlcharrefreplace')
my_string = label_content.replace(unicode_to_replace, html_code)
But the unicode string is not converted in the right way, any suggestion?但是unicode字符串没有以正确的方式转换,有什么建议吗?
Thanks in advance!提前致谢!
Found a solution by myself, for anybody who's intrested in this.对于任何对此感兴趣的人,我自己找到了一个解决方案。 It differs a bit from what i've asked, the output does not show unicodes to html entities, but converts them to the corresponding char, because in my case this is better.它与我所问的有点不同,输出没有将 unicodes 显示为 html 实体,而是将它们转换为相应的字符,因为在我的情况下这更好。
So the final portion of code looks like this:所以代码的最后一部分是这样的:
# e.g. of an input string containing some sort of unicodes.
# This is how they are formatted in my input file.
my_string = "This is \u+0024. a string with \u+0024. random \u+0024. unicodes"
if "\\u+" in my_string :
unicode_code = (my_string .split("\\u+"))[1].split('.')[0]
unicode_to_replace = f"\\u+{unicode_code}."
unicode = f"\\u{unicode_code}"
# Where the actual unicode is converted to html entity
html_entity = unicode.encode('utf-8').decode('raw-unicode-escape')
my_string = my_string .replace(unicode_to_replace, html_entity)
print(my_string)
my_string >> "This is $ a string with $ random $
I'd prefer applying Regular expression operations ( re
module) .我更喜欢应用正则表达式操作( re
模块) 。 The pattern
variable covers pattern
变量覆盖
U+042F
instead of the middle U+0024
),所有有效的 Unicode 值(参见例如U+042F
而不是中间的U+0024
),input
variable in the original question was edited three times ( with/without leading backslash and/or trailing dot), and输入字符串的所有语法版本:原始问题中的input
变量被编辑了 3 次(带/不带前导反斜杠和/或尾随点),以及my_string
variable in the OQ's self answer is incorrect: '\4'
raises the truncated \\uXXXX escape error. OQ 的自我回答中的my_string
变量不正确: '\4'
引发了截断的 \\uXXXX 转义错误。The script:剧本:
import re
def UPlusHtml(matchobj):
return re.sub( r"^\\?[uU]\+", '&#x',
re.sub( r'\.$', '', matchobj.group(0) ) ) + ';';
def UPlusRepl(matchobj):
return chr( int( re.sub( r"^\\?[uU]\+", '',
re.sub( r'\.$', '', matchobj.group(0) ) ),16 ) );
pattern = r"(\\?[uU]\+[0-9a-fA-F]+\.?)"
input = "This is U+0024. a string with U+042f random U+0024. unicode"
print( input )
print( re.sub( pattern, UPlusHtml, input ) )
print( re.sub( pattern, UPlusRepl, input ) )
print('--')
my_string = "This is \\u+0024. a string with \\u+042F random \\u+0024. unicodes"
print( my_string )
print( re.sub( pattern, UPlusHtml, my_string ) )
print( re.sub( pattern, UPlusRepl, my_string ) )
Output : \\SO\\67105976.py
输出: \\SO\\67105976.py
This is U+0024. a string with U+042f random U+0024. unicode
This is $ a string with Я random $ unicode
This is $ a string with Я random $ unicode
--
This is \u+0024. a string with \u+042F random \u+0024. unicodes
This is $ a string with Я random $ unicodes
This is $ a string with Я random $ unicodes
Please note that I'm a regex beginner myself so I believe that the must exist more efficient regex-based solution, without any doubt…请注意,我自己是一个正则表达式初学者,所以我相信必须存在更有效的基于正则表达式的解决方案,毫无疑问……
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.