Pyspark dataframe 从请求中获取的 python 字典 (json) 读取时记录损坏，编码问题

Question

我正在使用请求库进行 REST api 调用。

response = requests.get("https://urltomaketheapicall", headers={'authorization': 'bearer {0}'.format("7777777777777777777777777777")}, timeout=5)

当我做response.json()

我得到了这些值的密钥

{'devices': '....iPhone\xa05S, iPhone\xa06, iPhone\xa06\xa0Plus, iPhone\xa06S'}

当我做print(response.encoding)我得到None

当我执行print(type(data[devices]))我得到<class 'str'>

如果我执行print(data[devices])我会得到没有特殊字符'....iPhone 5S, iPhone 6, iPhone 6 Plus, iPhone 6S' 。

现在如果做

new_dict={}
new_val = data[devices]
new_dict["devices"] = new_val
print(new_dict["devices"])

我也会得到新词典中的特殊字符。

有任何想法吗？

我想摆脱特殊字符，因为我需要阅读这些 json 并将其放入 pyspark dataframe 并且使用这些字符我得到一个_corrupted_record

rd= spark.sparkContext.parallelize([data])
df = spark.read.json(rd)

我想避免像.replace("\\xa0"," ")这样的解决方案

Answer 1

A0是一个不间断空间。 它只是字符串的一部分。 它只是像那样打印，因为你正在转储整个 dict 的 repr。 如果您打印单个字符串，它将简单地打印为正确的不间断空格：

>>> print({'a': '\xa0'})
{'a': '\xa0'}
>>> print('\xa0')
 
>>>