简体   繁体   English

为什么在数据框中创建新列时NaN值显示不正确?

[英]Why NaN values appear incorrectly when creating new column in dataframe?

In Python 3 and pandas I have this dataframe 在Python 3和熊猫中,我有此数据框

eleitos_d_doadores.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 47490 entries, 0 to 47489
Data columns (total 21 columns):
uf_x                          47490 non-null object
partido_eleicao_x             47490 non-null object
cargo_x                       47490 non-null object
nome_completo_x               47490 non-null object
cpf                           47490 non-null object
cpf_cnpj_doador               47490 non-null object
nome_doador                   47490 non-null object
valor                         47490 non-null object
tipo_receita                  47490 non-null object
fonte_recurso                 47490 non-null object
especie_recurso               47490 non-null object
cpf_cnpj_doador_originario    47490 non-null object
nome_doador_originario        47490 non-null object
tipo_doador_originario        47490 non-null object
Unnamed: 0                    47490 non-null int64
uf_y                          47490 non-null object
cargo_y                       47490 non-null object
nome_completo_y               47490 non-null object
nome_urna                     47490 non-null object
partido_eleicao_y             47490 non-null object
situacao                      47490 non-null object
dtypes: int64(1), object(20)
memory usage: 8.0+ MB

I used this command to create a new column with the first eight characters of column "cpf_cnpj_doador" 我使用此命令用“ cpf_cnpj_doador”列的前八个字符创建新列

eleitos_d_doadores['cnpj_raiz_doador'] = eleitos_d_doadores.cpf_cnpj_doador.str[:8]

This correctly truncated many of the lines: "01888360712" became "01888360" 这正确地截断了许多行:“ 01888360712”变为“ 01888360”

But there are many lines that did not truncate correctly, instead, the expected value was replaced with "NaN", incorrectly: "50844182000155" became NaN (here the correct value would be "50844182") 但是有许多行未正确截断,而是将期望值错误地替换为“ NaN”:“ 50844182000155”变为NaN(此处的正确值为“ 50844182”)

Does anyone know the origin of the NaN content? 有人知道NaN内容的来源吗?

Here are the commands I wrote to create the columns. 这是我编写的用于创建列的命令。 Then I selected a portion of the data that has errors and hits 然后,我选择了有错误和点击的部分数据

eleitos_d_doadores['cnpj_raiz_doador'] = eleitos_d_doadores.cpf_cnpj_doador.str[:8]

eleitos_d_doadores['cnpj_raiz_doador_originario'] = eleitos_d_doadores.cpf_cnpj_doador_originario.str[:8]

eleitos_d_doadores.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 47490 entries, 0 to 47489
Data columns (total 23 columns):
uf_x                           47490 non-null object
partido_eleicao_x              47490 non-null object
cargo_x                        47490 non-null object
nome_completo_x                47490 non-null object
cpf                            47490 non-null object
cpf_cnpj_doador                47490 non-null object
nome_doador                    47490 non-null object
valor                          47490 non-null object
tipo_receita                   47490 non-null object
fonte_recurso                  47490 non-null object
especie_recurso                47490 non-null object
cpf_cnpj_doador_originario     47490 non-null object
nome_doador_originario         47490 non-null object
tipo_doador_originario         47490 non-null object
Unnamed: 0                     47490 non-null int64
uf_y                           47490 non-null object
cargo_y                        47490 non-null object
nome_completo_y                47490 non-null object
nome_urna                      47490 non-null object
partido_eleicao_y              47490 non-null object
situacao                       47490 non-null object
cnpj_raiz_doador               3488 non-null object
cnpj_raiz_doador_originario    47490 non-null object
dtypes: int64(1), object(22)
memory usage: 8.7+ MB

nome = eleitos_d_doadores[(eleitos_d_doadores['nome_completo_x'] == 'JULIO CESAR DELGADO')]

nome.loc[:, ['cpf_cnpj_doador', 'cnpj_raiz_doador']]

    cpf_cnpj_doador     cnpj_raiz_doador
7390    1421697000137   NaN
7391    1421697000137   NaN
7392    1421697000137   NaN
7393    1421697000137   NaN
7394    56993900000131  NaN
7395    26198515000484  NaN
7396    26198515000484  NaN
7397    20574428000155  NaN
7398    12082605000158  NaN
7399    60892403000114  NaN
7400    17469701000177  NaN
7401    66460080000176  NaN
7402    21561725000129  NaN
7403    50844182000155  NaN
7404    3940864000181   NaN
7405    3940864000181   NaN
7406    3940864000181   NaN
7407    3940864000181   NaN
7408    3940864000181   NaN
7409    3940864000181   NaN
7410    3940864000181   NaN
7411    00697656691     00697656
7412    03776208660     03776208
7413    16760808649     NaN
7414    17153081000162  NaN
7415    20573722000142  NaN
7416    20573722000142  NaN
7417    20573722000142  NaN
7418    20573722000142  NaN
7419    20592604000181  NaN
7420    20573722000142  NaN
7421    15102288000182  NaN
7422    33131541000108  NaN
7423    20575279000149  NaN
7424    20575492000150  NaN

nome.loc[:, ['cpf_cnpj_doador_originario', 'cnpj_raiz_doador_originario']]
cpf_cnpj_doador_originario  cnpj_raiz_doador_originario
7390    17262213000194  17262213
7391    90400888000142  90400888
7392    16639904000100  16639904
7393    00447821000170  00447821
7394    #NULO   #NULO
7395    #NULO   #NULO
7396    #NULO   #NULO
7397    38105195100     38105195
7398    #NULO   #NULO
7399    #NULO   #NULO
7400    #NULO   #NULO
7401    #NULO   #NULO
7402    #NULO   #NULO
7403    #NULO   #NULO
7404    61186888000193  61186888
7405    15102288000182  15102288
7406    92693118000160  92693118
7407    92693118000160  92693118
7408    02125403000192  02125403
7409    33000092000169  33000092
7410    07052569000140  07052569
7411    #NULO   #NULO
7412    #NULO   #NULO
7413    #NULO   #NULO
7414    #NULO   #NULO
7415    03349915000103  03349915
7416    17463456000190  17463456
7417    71077747000196  71077747
7418    03349915000103  03349915
7419    04899037000154  04899037
7420    06142647000134  06142647
7421    #NULO   #NULO
7422    #NULO   #NULO
7423    04641376000136  04641376
7424    08250286634     08250286

You can use the pandas.DataFrame.dropna method to avoid the NaN values: 您可以使用pandas.DataFrame.dropna方法来避免NaN值:

Pandas documentation 熊猫文件

DataFrame.dropna(subset=['ColumnToCheck'], how='all', inplace = True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM