简体   繁体   English

如何更改python数组的编码?

[英]How to change the coding for python array?

I use the following code to scrape a table from a Chinese website. 我使用以下代码从中文网站上抓取表格。 It works fine. 工作正常。 But it seems that the contents I stored in the list are not shown properly. 但似乎我存储在列表中的内容未正确显示。

import requests
from bs4 import BeautifulSoup
import pandas as pd

x = requests.get('http://www.sohu.com/a/79780904_126549')
bs = BeautifulSoup(x.text,'lxml')

clg_list = []

for tr in bs.find_all('tr'):
    tds = tr.find_all('td')
    for i in range(len(tds)):
       clg_list.append(tds[i].text)
       print(tds[i].text)

When I print the text, it shows Chinese characters. 当我打印文本时,它显示汉字。 But when I print out the list, it's showing \一\期\(34\所\)'. 但是当我打印出列表时,它显示的是\\ u4e00 \\ u671f \\ uff0834 \\ u6240 \\ uff09'。 I am not sure if I should change the encoding or something else is wrong. 我不确定是否应该更改编码或其他错误。

There is nothing wrong in this case. 在这种情况下没有错。

When you print a python list, python calls repr on each of the list's elements. 当您打印python列表时,python会在列表的每个元素上调用repr In python2, the repr of a unicode string shows the unicode code points for the characters that make up the string. 在python2中,unicode字符串的repr显示组成字符串的字符的unicode代码点。

>>> c = clg_list[0]
>>> c # Ask the interpreter to display the repr of c
u'\u201c985\u201d\u5de5\u7a0b\u5927\u5b66\u540d\u5355\uff08\u622a\u6b62\u52302011\u5e743\u670831\u65e5\uff09'

However, if you print the string, python encodes the unicode string with a text encoding (for example, utf-8) and your computer displays the characters that match the encoding. 但是,如果您print该字符串,则python将使用文本编码(例如utf-8)对unicode字符串进行编码,并且您的计算机将显示与该编码匹配的字符。

>>> print c
“985”工程大学名单(截止到2011年3月31日)

Note that in python3 printing the list will show the chinese characters as you expect, because of python3's better unicode handling. 请注意,在python3打印中,由于python3更好的unicode处理,该列表将按预期显示中文字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM