简体   繁体   English

Python 2.7:无法编码为UTF-8

[英]Python 2.7: Trouble Encoding to UTF-8

I have a dataframe that has a column, _text , containing the text of an article. 我有一个数据框,其中有一列_text ,其中包含文章的文本。 I'm trying to get the length of the article for each row in my dataframe. 我正在尝试获取数据帧中每一行的文章长度。 Here's my attempt: 这是我的尝试:

from bs4 import BeautifulSoup
result_df['_text'] = [BeautifulSoup(text, "lxml").get_text() for text in result_df['_text']]

text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]

Unfortunately, I get this error: 不幸的是,我得到这个错误:

    ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-8-f6c8ab83a46f> in <module>()
----> 1 text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 231: ordinal not in range(128)

Seems like I should be specifying "utf-8" somewhere, I'm just not sure where... 似乎我应该在某个地方指定“ utf-8”,但我不确定在哪里...

Thanks! 谢谢!

I assume that you use a Python 2 version, and that your input text contains non ASCII characters. 我假设您使用的是Python 2版本,并且您的输入文本包含非ASCII字符。 The problem arises at str(x) which by default when x is a unicode string ends in x.encode('ascii') 问题出现在str(x)上,默认情况下,当x是Unicode字符串时,它以x.encode('ascii')结尾

You have 2 ways to solve this problem: 您有2种方法可以解决此问题:

  1. correctly encode the unicode string in utf-8: 在utf-8中正确编码unicode字符串:

     text_word_length = [len(x.encode('utf-8').split(" ")) for x in result_df['_text']] 
  2. split the string as unicode: 将字符串拆分为unicode:

     text_word_length = [len(x.split(u" ")) for x in result_df['_text']] 

Acording to the official python documentation: Python Official Site 根据官方python文档: Python Official Site

To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as: 要定义源代码编码,必须将魔术注释作为源文件的第一行或第二行放置在源文件中,例如:

# coding=<encoding name>

or (using formats recognized by popular editors): 或(使用流行的编辑器认可的格式):

#!/usr/bin/python
# -*- coding: <encoding name> -*-

or: 要么:

#!/usr/bin/python
# vim: set fileencoding=<encoding name> :

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM