使用BeautifulSoup从HTML中提取外国字符（即中文）？

Question

I have a text file of 1,000+ URLs, with each URL linking to a journal entry of text. 我有一个包含1,000多个URL的文本文件，每个URL都链接到文本的日记条目。 Some of these entries contain Chinese or Japanese characters. 其中一些条目包含中文或日语字符。

I would like to save each entry using BeautifulSoup. 我想使用BeautifulSoup保存每个条目。 However, I cannot figure out how encoding and decoding works in this situation. 但是，我无法弄清楚这种情况下的编码和解码如何工作。 I've browsed Stack Overflow for help, and I can only find instances in which the string itself is known and set as a variable. 我浏览了Stack Overflow以寻求帮助，我只能找到其中字符串本身已知并将其设置为变量的实例。

However, given that I am scraping from a list of URLs, I do not know what strings I will find until I collect them. 但是，鉴于我是从URL列表中抓取的，因此我不知道在收集它们之前会找到什么字符串。

This is what I have so far: 这是我到目前为止的内容：

with open(data_src) as f:
  resp = requests.get(f.readlines()[419])
  raw_text = resp.text
  soup = BeautifulSoup(raw_text, 'html.parser')
  for s in soup.findAll('script'):
      s.replaceWith('')
  entry = soup.select('div#body_show_ori')[0]
  print(entry.text.encode('utf-8'))

This is the string that prints: 这是打印的字符串：

b'\\n\\xe6\\x88\\x91\\xe7\\xbb\\x88\\xe4\\xba\\x8e\\xe5\\x88\\xb0\\xe4\\xba\\x86\\xe4\\xb8\\xad\\xe5\\x9b\\xbd\\xe5\\x8e\\xa6\\xe9\\x97\\xa8\\xe3\\x80\\x82\\xe6\\x88\\x91\\xe8\\xa7\\x89\\xe5\\xbe\\x97\\xe8\\xbf\\x99\\xe9\\x87\\x8c\\xe5\\xbe\\x88\\xe7\\x83\\xad\\xe5\\xbe\\x88\\xe6\\xbd\\xae\\xe6\\xb9\\xbf\\xe3\\x80\\x82\\xe7\\x8e\\xb0\\xe5\\x9c\\xa8\\xe6\\x88\\x91\\xe6\\xb2\\xa1\\xe6\\x9c\\x89\\xe6\\x9c\\x8b\\xe5\\x8f\\x8b\\xe8\\x80\\x8c\\xe4\\xb8\\x94\\xe8\\xbf\\x99\\xe4\\xb8\\xaa\\xe5\\x9c\\xb0\\xe6\\x96\\xb9\\xe6\\x88\\x91\\xe4\\xb8\\x8d\\xe7\\x86\\x9f\\xe6\\x82\\x89\\xe3\\x80\\x82\\xe4\\xb8\\x8d\\xe6\\x95\\xa2\\xe5\\x87\\xba\\xe5\\x8e\\xbb\\xe5\\xa4\\x96\\xe9\\x9d\\xa2\\xe3\\x80\\x82\\xe3\\x80\\x82\\xe3\\x80\\x82\\xe5\\xa5\\xbd\\xe6\\x97\\xa0\\xe8\\x81\\x8a\\xe3\\x80\\x82\\xe3\\x80\\x82\\xe3\\x80\\x82\\n' b'\\ n \\ xe6 \\ x88 \\ x91 \\ xe7 \\ xbb \\ x88 \\ xe4 \\ xba \\ x8e \\ xe5 \\ x88 \\ xb0 \\ xe4 \\ xba \\ x86 \\ xe4 \\ xb8 \\ xad \\ xe5 \\ x9b \\ xbd \\ xe5 \\ x8e \\ xa6 \\ xe9 \\ x97 \\ xa8 \\ xe3 \\ x80 \\ x82 \\ xe6 \\ x88 \\ x91 \\ xe8 \\ xa7 \\ x89 \\ xe5 \\ xbe \\ x97 \\ xe8 \\ xbf \\ x99 \\ xe9 \\ x87 \\ x87 \\ x8c \\ xe5 \\ xbe \\ x88 \\ xe7 \\ x83 \\ xad \\ xe5 \\ xbe \\ x88 \\ xe6 \\ xbd \\ xae \\ xe6 \\ xb9 \\ xbf \\ xe3 \\ x80 \\ x82 \\ xe7 \\ x8e \\ xb0 \\ xe5 \\ x9c \\ xa8 \\ xe6 \\ x88 \\ x91 \\ xe6 \\ xb2 \\ xa1 \\ xe6 \\ x9c \\ x89 \\ xe6 \\ x9c \\ x8b \\ xe5 \\ x8f \\ x8b \\ xe8 \\ x80 \\ x8c \\ xe4 \\ xb8 \\ x94 \\ xe8 \\ xbf \\ x99 \\ xe4 \\ xb8 \\ xaa \\ xe5 \\ x9c \\ xb0 \\ xe6 \\ x96 \\ xb9 \\ xe6 \\ x88 \\ x91 \\ xe4 \\ xb8 \\ x8d \\ xe7 \\ x86 \\ x9f \\ xe6 \\ x82 \\ x89 \\ xe3 \\ x80 \\ x82 \\ xe4 \\ xb8 \\ x8d \\ xe6 \\ x95 \\ xa2 \\ xe5 \\ x87 \\ xba \\ xe5 \\ x8e \\ xbb \\ xe5 \\ xa4 \\ x96 \\ xe9 \\ x9d \\ xa2 \\ xe3 \\ x80 \\ x82 \\ xe3 \\ x80 \\ x82 \\ xe3 \\ x80 \\ x82 \\ xe5 \\ xa5 \\ xbd \\ xe6 \\ x97 \\ xa0 \\ xe8 \\ x81 \\ x8a \\ xe3 \\ x80 \\ x82 \\ xe3 \\ x80 \\ x82 \\ xe3 \\ x80 \\ x82 \\ n'

This is where I'm stuck; 这就是我被困住的地方； I'm trying to figure out how to decode the string from here. 我试图弄清楚如何从这里解码字符串。

Answer 1

Try decoding before passing the data to beautifulsoup. 在将数据传递给beautifulsoup 之前，请尝试解码。

IIRC coreectly, if you pass a unicode object, it will not decode it again. 从本质上讲，IIRC如果传递unicode对象，它将不会再次对其进行解码。

使用BeautifulSoup从HTML中提取外国字符（即中文）？

问题描述

1 个解决方案

解决方案1
0 2015-11-24 23:28:07

使用BeautifulSoup从HTML中提取外国字符（即中文）？

问题描述

1 个解决方案

解决方案1 0 2015-11-24 23:28:07

解决方案1
0 2015-11-24 23:28:07