[英]Why NaN values appear incorrectly when creating new column in dataframe?
In Python 3 and pandas I have this dataframe 在Python 3和熊猫中,我有此数据框
eleitos_d_doadores.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 47490 entries, 0 to 47489
Data columns (total 21 columns):
uf_x 47490 non-null object
partido_eleicao_x 47490 non-null object
cargo_x 47490 non-null object
nome_completo_x 47490 non-null object
cpf 47490 non-null object
cpf_cnpj_doador 47490 non-null object
nome_doador 47490 non-null object
valor 47490 non-null object
tipo_receita 47490 non-null object
fonte_recurso 47490 non-null object
especie_recurso 47490 non-null object
cpf_cnpj_doador_originario 47490 non-null object
nome_doador_originario 47490 non-null object
tipo_doador_originario 47490 non-null object
Unnamed: 0 47490 non-null int64
uf_y 47490 non-null object
cargo_y 47490 non-null object
nome_completo_y 47490 non-null object
nome_urna 47490 non-null object
partido_eleicao_y 47490 non-null object
situacao 47490 non-null object
dtypes: int64(1), object(20)
memory usage: 8.0+ MB
I used this command to create a new column with the first eight characters of column "cpf_cnpj_doador" 我使用此命令用“ cpf_cnpj_doador”列的前八个字符创建新列
eleitos_d_doadores['cnpj_raiz_doador'] = eleitos_d_doadores.cpf_cnpj_doador.str[:8]
This correctly truncated many of the lines: "01888360712" became "01888360" 这正确地截断了许多行:“ 01888360712”变为“ 01888360”
But there are many lines that did not truncate correctly, instead, the expected value was replaced with "NaN", incorrectly: "50844182000155" became NaN (here the correct value would be "50844182") 但是有许多行未正确截断,而是将期望值错误地替换为“ NaN”:“ 50844182000155”变为NaN(此处的正确值为“ 50844182”)
Does anyone know the origin of the NaN content? 有人知道NaN内容的来源吗?
Here are the commands I wrote to create the columns. 这是我编写的用于创建列的命令。 Then I selected a portion of the data that has errors and hits
然后,我选择了有错误和点击的部分数据
eleitos_d_doadores['cnpj_raiz_doador'] = eleitos_d_doadores.cpf_cnpj_doador.str[:8]
eleitos_d_doadores['cnpj_raiz_doador_originario'] = eleitos_d_doadores.cpf_cnpj_doador_originario.str[:8]
eleitos_d_doadores.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 47490 entries, 0 to 47489
Data columns (total 23 columns):
uf_x 47490 non-null object
partido_eleicao_x 47490 non-null object
cargo_x 47490 non-null object
nome_completo_x 47490 non-null object
cpf 47490 non-null object
cpf_cnpj_doador 47490 non-null object
nome_doador 47490 non-null object
valor 47490 non-null object
tipo_receita 47490 non-null object
fonte_recurso 47490 non-null object
especie_recurso 47490 non-null object
cpf_cnpj_doador_originario 47490 non-null object
nome_doador_originario 47490 non-null object
tipo_doador_originario 47490 non-null object
Unnamed: 0 47490 non-null int64
uf_y 47490 non-null object
cargo_y 47490 non-null object
nome_completo_y 47490 non-null object
nome_urna 47490 non-null object
partido_eleicao_y 47490 non-null object
situacao 47490 non-null object
cnpj_raiz_doador 3488 non-null object
cnpj_raiz_doador_originario 47490 non-null object
dtypes: int64(1), object(22)
memory usage: 8.7+ MB
nome = eleitos_d_doadores[(eleitos_d_doadores['nome_completo_x'] == 'JULIO CESAR DELGADO')]
nome.loc[:, ['cpf_cnpj_doador', 'cnpj_raiz_doador']]
cpf_cnpj_doador cnpj_raiz_doador
7390 1421697000137 NaN
7391 1421697000137 NaN
7392 1421697000137 NaN
7393 1421697000137 NaN
7394 56993900000131 NaN
7395 26198515000484 NaN
7396 26198515000484 NaN
7397 20574428000155 NaN
7398 12082605000158 NaN
7399 60892403000114 NaN
7400 17469701000177 NaN
7401 66460080000176 NaN
7402 21561725000129 NaN
7403 50844182000155 NaN
7404 3940864000181 NaN
7405 3940864000181 NaN
7406 3940864000181 NaN
7407 3940864000181 NaN
7408 3940864000181 NaN
7409 3940864000181 NaN
7410 3940864000181 NaN
7411 00697656691 00697656
7412 03776208660 03776208
7413 16760808649 NaN
7414 17153081000162 NaN
7415 20573722000142 NaN
7416 20573722000142 NaN
7417 20573722000142 NaN
7418 20573722000142 NaN
7419 20592604000181 NaN
7420 20573722000142 NaN
7421 15102288000182 NaN
7422 33131541000108 NaN
7423 20575279000149 NaN
7424 20575492000150 NaN
nome.loc[:, ['cpf_cnpj_doador_originario', 'cnpj_raiz_doador_originario']]
cpf_cnpj_doador_originario cnpj_raiz_doador_originario
7390 17262213000194 17262213
7391 90400888000142 90400888
7392 16639904000100 16639904
7393 00447821000170 00447821
7394 #NULO #NULO
7395 #NULO #NULO
7396 #NULO #NULO
7397 38105195100 38105195
7398 #NULO #NULO
7399 #NULO #NULO
7400 #NULO #NULO
7401 #NULO #NULO
7402 #NULO #NULO
7403 #NULO #NULO
7404 61186888000193 61186888
7405 15102288000182 15102288
7406 92693118000160 92693118
7407 92693118000160 92693118
7408 02125403000192 02125403
7409 33000092000169 33000092
7410 07052569000140 07052569
7411 #NULO #NULO
7412 #NULO #NULO
7413 #NULO #NULO
7414 #NULO #NULO
7415 03349915000103 03349915
7416 17463456000190 17463456
7417 71077747000196 71077747
7418 03349915000103 03349915
7419 04899037000154 04899037
7420 06142647000134 06142647
7421 #NULO #NULO
7422 #NULO #NULO
7423 04641376000136 04641376
7424 08250286634 08250286
You can use the pandas.DataFrame.dropna method to avoid the NaN values: 您可以使用pandas.DataFrame.dropna方法来避免NaN值:
DataFrame.dropna(subset=['ColumnToCheck'], how='all', inplace = True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.