如何用适当的unicode字符替换转义的unicode字符？

Question

I have a string as such: 我有这样的字符串：

'https://www.jobtestprep.co.uk/media/24543/xnumber-series-big-1.png,qanchor\\u003dcenter,amode\\u003dcrop,awidth\\u003d473,aheight\\u003d352,arnd\\u003d131255524960000000.pagespeed.ic.YolXsWmhs0.png'

I need to replace an arbitrary escaped unicode character ( '\\\\uXXXX' ) with its equivalent unescaped unicode character ( '\\uXXXX' ). 我需要用等效的未转义的 unicode字符（ '\\uXXXX' ）替换任意的转义的unicode字符（ '\\\\uXXXX' '\\uXXXX' ）。 I've got Regex to extract all of the necessary parts (the '\\\\uXXXX' part and the 'XXXX' part for re.sub() ) but I can't find a way to replace the right part with \\u{}\u003c/code> as Python gives a Unicode error and wants a pre-filled in character such as '\=' . 我已经用Regex提取了所有必要的部分（ re.sub()的'\\\\uXXXX'部分和'XXXX'部分），但是我找不到找到用\\u{}\u003c/code>因为Python会产生Unicode错误，并希望预先填充诸如'\='字符。 Using raw strings doesn't work as '\\u{}\u0026#39; is just converted back into '\\\\u{}\u0026#39; and we end up back where we started. 使用原始字符串不起作用，因为'\\u{}\u0026#39;只是被转换回'\\\\u{}\u0026#39; ，我们最终回到了开始的地方。

Is there a way to do this? 有没有办法做到这一点？ If you want an example of the code you can have a look at it here: 如果您需要代码示例，可以在这里查看：

# data loaded from a https://www.google.com/search image search

results_source = urllib.request.urlopen(url_request).read().decode()
searched_results = re.findall(r"(?<=,\"ou\":\")[^\s]+[\w](?=\",\"ow\")", results_source)

for count, unicode in enumerate(re.findall(r"(?<=\\u)....", searched_results[i])):
    searched_results[i] = re.sub(re.findall(r"\\u....", searched_results[i])[count], r"\u{}".format(unicode), searched_results[i])

searched_results is a list of results returned. searched_results是返回结果的列表。 An example of an item in the list would be the string given above. 列表中的一个项目示例就是上面给出的字符串。

Answer 1

Your regex extracts JSON strings from a webpage: 您的正则表达式从网页中提取JSON字符串 ：

searched_results = re.findall(r"(?<=,\"ou\":\")[^\s]+[\w](?=\",\"ow\")", results_source)

Those " chacarters you removed were actually significant. The \\uxxxx escape syntax here is specific to JSON (and Javascript) syntax; they are closely related to Python's use but different (not much, but it matters when you have non-BMP codepoints). 那些"您删除chacarters实际上是显著的。 \\uxxxx逃离这里的语法是特定于JSON（和JavaScript）的语法，它们是密切相关的Python的使用，但是不同的（并不多，但是当你有非BMP代码点它很重要）。

You can trivially decode them as JSON , if you keep the quotes in there: 如果将引号保留在其中，则可以将它们解码为JSON：

searched_results = map(json.loads, re.findall(r"(?<=,\"ou\":)\"[^\s]+[\w]\"(?=,\"ow\")", results_source))

Better still would be to use a HTML library to parse the page. 最好还是使用HTML库来解析页面。 When using BeautifulSoup , you can get the data with: 使用BeautifulSoup时，您可以通过以下方式获取数据：

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(results_source, 'html.parser')
search_results = [json.loads(t.text)['ou'] for t in soup.select('.rg_meta')]

This loads the text contents of each <div class="rg_meta" ...> element as JSON data, and extracts the ou key from each of the resulting dictionaries. 这会将每个<div class="rg_meta" ...>元素的文本内容作为JSON数据<div class="rg_meta" ...> ，并从每个结果字典中提取ou键。 No regular expressions required. 不需要正则表达式。

Answer 2

You can do in this way. 您可以通过这种方式进行。

>>> url = (
...    'https://www.jobtestprep.co.uk/media/24543/xnumber-series-'
...    'big-1.png,qanchor\\u003dcenter,amode\\u003dcrop,awidth\\u003d473,'
...    'aheight\\u003d352,arnd\\u003d131255524960000000.pagespeed.ic.YolXsWmhs0.png'
... )
>>> url = url.encode('utf-8').decode('unicode_escape')
>>> print(url)
https://www.jobtestprep.co.uk/media/24543/xnumber-series-big-1.png,qanchor=center,amode
=crop,awidth=473,aheight=352,arnd=131255524960000000.pagespeed.ic.YolXsWmhs0.png
>>>

如何用适当的unicode字符替换转义的unicode字符？

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-07-08 14:40:03

解决方案2
0 2018-07-08 14:12:33

如何用适当的unicode字符替换转义的unicode字符？

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-07-08 14:40:03

解决方案2 0 2018-07-08 14:12:33

解决方案1
1 已采纳 2018-07-08 14:40:03

解决方案2
0 2018-07-08 14:12:33