繁体   English   中英

如何解决 python pandas 编码问题?

[英]how to fix python pandas encoding issue?

我将 csv 表导入 JUPYTER 笔记本,当我尝试iloc视频观看列 (К-ть переглядів) 时发生了错误。

我需要将此单元格格式化为 INT 类型(使用.astype() ),但它告诉我存在错误:

ValueError:以 10 为底的 int() 的无效文字:'380\xa0891\xa0555'

谁能告诉我哪里出了问题?

截屏: 在此处输入图像描述

这是一个不间断的空格 ( chr(160) )。 使用str.replace删除它们。

>>> df['A']
0    380 891 555
Name: A, dtype: object

>>> df['A'].dtype.name
'object'

>>> df['A'].astype(int)
ValueError: invalid literal for int() with base 10: '380\xa0891\xa0555'

>>> df['A'].str.replace(chr(160), '').astype(int)
0    380891555
Name: A, dtype: int64

请查看另一个正确答案的更新代码。

# 1. Create a list of encoding;

encoding_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 
                 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 
                 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']

# 2. Pass this updated loop to catch only the encoding that works.

for encoding in encoding_list:
    worked = True
    try:
        df = pd.read_csv(path, encoding=encoding, nrows=5, sep=';')
    except:
        worked = False
    if worked:
        if df.iloc[:,0].notna().sum()>0:
            print(f'Encoding that works finally: << {encoding} >>')
        else:
            pass

显示数据框

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM