简体   繁体   中英

foreign characters (i.e. Chinese) from HTML using BeautifulSoup?

I have a text file of 1,000+ URLs, with each URL linking to a journal entry of text. Some of these entries contain Chinese or Japanese characters.

I would like to save each entry using BeautifulSoup. However, I cannot figure out how encoding and decoding works in this situation. I've browsed Stack Overflow for help, and I can only find instances in which the string itself is known and set as a variable.

However, given that I am scraping from a list of URLs, I do not know what strings I will find until I collect them.

This is what I have so far:

with open(data_src) as f:
  resp = requests.get(f.readlines()[419])
  raw_text = resp.text
  soup = BeautifulSoup(raw_text, 'html.parser')
  for s in soup.findAll('script'):
      s.replaceWith('')
  entry = soup.select('div#body_show_ori')[0]
  print(entry.text.encode('utf-8'))

This is the string that prints:

b'\\n\\xe6\\x88\\x91\\xe7\\xbb\\x88\\xe4\\xba\\x8e\\xe5\\x88\\xb0\\xe4\\xba\\x86\\xe4\\xb8\\xad\\xe5\\x9b\\xbd\\xe5\\x8e\\xa6\\xe9\\x97\\xa8\\xe3\\x80\\x82\\xe6\\x88\\x91\\xe8\\xa7\\x89\\xe5\\xbe\\x97\\xe8\\xbf\\x99\\xe9\\x87\\x8c\\xe5\\xbe\\x88\\xe7\\x83\\xad\\xe5\\xbe\\x88\\xe6\\xbd\\xae\\xe6\\xb9\\xbf\\xe3\\x80\\x82\\xe7\\x8e\\xb0\\xe5\\x9c\\xa8\\xe6\\x88\\x91\\xe6\\xb2\\xa1\\xe6\\x9c\\x89\\xe6\\x9c\\x8b\\xe5\\x8f\\x8b\\xe8\\x80\\x8c\\xe4\\xb8\\x94\\xe8\\xbf\\x99\\xe4\\xb8\\xaa\\xe5\\x9c\\xb0\\xe6\\x96\\xb9\\xe6\\x88\\x91\\xe4\\xb8\\x8d\\xe7\\x86\\x9f\\xe6\\x82\\x89\\xe3\\x80\\x82\\xe4\\xb8\\x8d\\xe6\\x95\\xa2\\xe5\\x87\\xba\\xe5\\x8e\\xbb\\xe5\\xa4\\x96\\xe9\\x9d\\xa2\\xe3\\x80\\x82\\xe3\\x80\\x82\\xe3\\x80\\x82\\xe5\\xa5\\xbd\\xe6\\x97\\xa0\\xe8\\x81\\x8a\\xe3\\x80\\x82\\xe3\\x80\\x82\\xe3\\x80\\x82\\n'

This is where I'm stuck; I'm trying to figure out how to decode the string from here.

Try decoding before passing the data to beautifulsoup.

IIRC coreectly, if you pass a unicode object, it will not decode it again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM