從API獲取的字符串中的奇怪字符無法解碼

Question

我正在創建一個程序，該程序從API抓取數據並將其存儲在我自己的數據庫中。 問題在於某些字符串在引號應帶有的字符代碼中存在某種形式。 經過進一步檢查，它似乎是引號的十六進制代碼，但是它被雙重轉義了，使我與我的所有解碼器混淆了。 我相信字符串以ascii的形式出現，其他字符沒有其他問題。

我知道我可以簡單地用實際字符替換特定的字符代碼，但是將來我需要抓住這樣的東西。 如果是十六進制，則需要梳理十六進制代碼的字符串，然后按程序替換它們。

我試過了

clean_val = unicodedata.normalize('NFKD', val).encode('latin1').decode('utf8')

我對整個事情感到很困惑

response = session.get(url)
    if response.status_code == requests.codes.ok:
        print(response.content)

b'{"Description":"American Assets Trust, Inc. (the \\\u0093company\\\u0094) is a full service, vertically ..."}'

我認為字符串像\\“一樣存儲在他們的數據庫中，以滿足某些SQL轉義協議。當我得到它時，轉義斜杠與字符代碼混在一起，從而弄亂了編碼。

Answer 1

看起來這些字符來自編碼為cp1252的文本。 可以解碼它們

>>> bs = b'{"Description":"American Assets Trust, Inc. (the \\u0093company\\u0094) is a full service, vertically ..."}'
>>> d = json.loads(bs)
>>> s = d['Description']
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded
'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'

但你必須手動替換它們，使用str.replace或str.translate

>>> table = str.maketrans('“”', '""')
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded.translate(table)
'American Assets Trust, Inc. (the "company") is a full service, vertically ...'

從API獲取的字符串中的奇怪字符無法解碼

問題描述

1 個解決方案

解決方案1
0 2019-03-02 18:08:02

從API獲取的字符串中的奇怪字符無法解碼

問題描述

1 個解決方案

解決方案1 0 2019-03-02 18:08:02

解決方案1
0 2019-03-02 18:08:02