简体   繁体   English

如何用适当的unicode字符替换转义的unicode字符?

[英]How to replace escaped unicode characters with proper unicode characters?

I have a string as such: 我有这样的字符串:

'https://www.jobtestprep.co.uk/media/24543/xnumber-series-big-1.png,qanchor\\u003dcenter,amode\\u003dcrop,awidth\\u003d473,aheight\\u003d352,arnd\\u003d131255524960000000.pagespeed.ic.YolXsWmhs0.png'

I need to replace an arbitrary escaped unicode character ( '\\\\uXXXX' ) with its equivalent unescaped unicode character ( '\\uXXXX' ). 我需要用等效的未转义的 unicode字符( '\\uXXXX' )替换任意的转义的unicode字符( '\\\\uXXXX' '\\uXXXX' )。 I've got Regex to extract all of the necessary parts (the '\\\\uXXXX' part and the 'XXXX' part for re.sub() ) but I can't find a way to replace the right part with \\u{}\u003c/code> as Python gives a Unicode error and wants a pre-filled in character such as '\=' . 我已经用Regex提取了所有必要的部分( re.sub()'\\\\uXXXX'部分和'XXXX'部分),但是我找不到找到用\\u{}\u003c/code>因为Python会产生Unicode错误,并希望预先填充诸如'\='字符。 Using raw strings doesn't work as '\\u{}\u0026#39; is just converted back into '\\\\u{}\u0026#39; and we end up back where we started. 使用原始字符串不起作用,因为'\\u{}\u0026#39;只是被转换回'\\\\u{}\u0026#39; ,我们最终回到了开始的地方。

Is there a way to do this? 有没有办法做到这一点? If you want an example of the code you can have a look at it here: 如果您需要代码示例,可以在这里查看:

# data loaded from a https://www.google.com/search image search

results_source = urllib.request.urlopen(url_request).read().decode()
searched_results = re.findall(r"(?<=,\"ou\":\")[^\s]+[\w](?=\",\"ow\")", results_source)

for count, unicode in enumerate(re.findall(r"(?<=\\u)....", searched_results[i])):
    searched_results[i] = re.sub(re.findall(r"\\u....", searched_results[i])[count], r"\u{}".format(unicode), searched_results[i])

searched_results is a list of results returned. searched_results是返回结果的列表。 An example of an item in the list would be the string given above. 列表中的一个项目示例就是上面给出的字符串。

Your regex extracts JSON strings from a webpage: 您的正则表达式从网页中提取JSON字符串

searched_results = re.findall(r"(?<=,\"ou\":\")[^\s]+[\w](?=\",\"ow\")", results_source)

Those " chacarters you removed were actually significant. The \\uxxxx escape syntax here is specific to JSON (and Javascript) syntax; they are closely related to Python's use but different (not much, but it matters when you have non-BMP codepoints). 那些"您删除chacarters实际上是显著的。 \\uxxxx逃离这里的语法是特定于JSON(和JavaScript)的语法,它们是密切相关的Python的使用,但是不同的(并不多,但是当你有非BMP代码点它很重要)。

You can trivially decode them as JSON , if you keep the quotes in there: 如果将引号保留在其中,则可以将它们解码为JSON:

searched_results = map(json.loads, re.findall(r"(?<=,\"ou\":)\"[^\s]+[\w]\"(?=,\"ow\")", results_source))

Better still would be to use a HTML library to parse the page. 最好还是使用HTML库来解析页面。 When using BeautifulSoup , you can get the data with: 使用BeautifulSoup时 ,您可以通过以下方式获取数据:

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(results_source, 'html.parser')
search_results = [json.loads(t.text)['ou'] for t in soup.select('.rg_meta')]

This loads the text contents of each <div class="rg_meta" ...> element as JSON data, and extracts the ou key from each of the resulting dictionaries. 这会将每个<div class="rg_meta" ...>元素的文本内容作为JSON数据<div class="rg_meta" ...> ,并从每个结果字典中提取ou键。 No regular expressions required. 不需要正则表达式。

You can do in this way. 您可以通过这种方式进行。

>>> url = (
...    'https://www.jobtestprep.co.uk/media/24543/xnumber-series-'
...    'big-1.png,qanchor\\u003dcenter,amode\\u003dcrop,awidth\\u003d473,'
...    'aheight\\u003d352,arnd\\u003d131255524960000000.pagespeed.ic.YolXsWmhs0.png'
... )
>>> url = url.encode('utf-8').decode('unicode_escape')
>>> print(url)
https://www.jobtestprep.co.uk/media/24543/xnumber-series-big-1.png,qanchor=center,amode
=crop,awidth=473,aheight=352,arnd=131255524960000000.pagespeed.ic.YolXsWmhs0.png
>>>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM