简体   繁体   中英

Python 2.7: Trouble Encoding to UTF-8

I have a dataframe that has a column, _text , containing the text of an article. I'm trying to get the length of the article for each row in my dataframe. Here's my attempt:

from bs4 import BeautifulSoup
result_df['_text'] = [BeautifulSoup(text, "lxml").get_text() for text in result_df['_text']]

text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]

Unfortunately, I get this error:

    ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-8-f6c8ab83a46f> in <module>()
----> 1 text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 231: ordinal not in range(128)

Seems like I should be specifying "utf-8" somewhere, I'm just not sure where...

Thanks!

I assume that you use a Python 2 version, and that your input text contains non ASCII characters. The problem arises at str(x) which by default when x is a unicode string ends in x.encode('ascii')

You have 2 ways to solve this problem:

  1. correctly encode the unicode string in utf-8:

     text_word_length = [len(x.encode('utf-8').split(" ")) for x in result_df['_text']] 
  2. split the string as unicode:

     text_word_length = [len(x.split(u" ")) for x in result_df['_text']] 

Acording to the official python documentation: Python Official Site

To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:

# coding=<encoding name>

or (using formats recognized by popular editors):

#!/usr/bin/python
# -*- coding: <encoding name> -*-

or:

#!/usr/bin/python
# vim: set fileencoding=<encoding name> :

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM