简体   繁体   中英

foreign characters (i.e. Chinese) from HTML using BeautifulSoup?

I have a text file of 1,000+ URLs, with each URL linking to a journal entry of text. Some of these entries contain Chinese or Japanese characters.

I would like to save each entry using BeautifulSoup. However, I cannot figure out how encoding and decoding works in this situation. I've browsed Stack Overflow for help, and I can only find instances in which the string itself is known and set as a variable.

However, given that I am scraping from a list of URLs, I do not know what strings I will find until I collect them.

This is what I have so far:

with open(data_src) as f:
  resp = requests.get(f.readlines()[419])
  raw_text = resp.text
  soup = BeautifulSoup(raw_text, 'html.parser')
  for s in soup.findAll('script'):
  entry = soup.select('div#body_show_ori')[0]

This is the string that prints:


This is where I'm stuck; I'm trying to figure out how to decode the string from here.

Try decoding before passing the data to beautifulsoup.

IIRC coreectly, if you pass a unicode object, it will not decode it again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM