[英]Sending Pandas Dataframe with Int64 type to GCP Spanner INT64 column
I am using Pandas Dataframes. 我正在使用Pandas Dataframes。 I have a column from a CSV which is integers mixed in with nulls.
我有一个来自CSV的列,它是整数与null混合在一起。
I am trying to convert this and insert it into Spanner in as generalizable a way as possible(so I can use the same code for future jobs), which reduces my ability to use sentinel variables. 我正在尝试将其转换并以尽可能通用的方式将其插入Spanner(以便将来的工作可以使用相同的代码),这降低了我使用前哨变量的能力。 However, DFs cannot handle
NaN
s in a pure int column so you have to use Int64
. 但是,DF无法处理纯int列中的
NaN
,因此您必须使用Int64
。 When I try to insert this into Spanner I get an error that it is not an int64
type, whereas pure Python int
s do work. 当我尝试将其插入Spanner时,我得到一个错误,它不是
int64
类型,而纯Python int
可以工作。 Is there an automatic way to convert Int64
Pandas values to int
values during the insert? 有没有一种自动的方法在插入过程中将
Int64
Pandas值转换为int
值? Converting the column before inserting doesn't work, again, because of the null values. 再次,由于空值,在插入之前转换列不起作用。 Is there another path around this?
有其他解决方法吗?
Trying to convert from a Series goes like so: 尝试从系列中进行转换是这样的:
>>>s2=pd.Series([3.0,5.0])
>>>s2
0 3.0
1 5.0
dtype: float64
>>>s1=pd.Series([3.0,None])
>>>s1
0 3.0
1 NaN
dtype: float64
>>>df = pd.DataFrame(data=[s1,s2], dtype=np.int64)
>>>df
0 1
0 3 NaN
1 3 5.0
>>>df = pd.DataFrame(data={"nullable": s1, "nonnullable": s2}, dtype=np.int64)
this last command produces the error ValueError: Cannot convert non-finite values (NA or inf) to integer
这最后一条命令产生错误
ValueError: Cannot convert non-finite values (NA or inf) to integer
I was unable to reproduce your issue but it seems everyone works as expected 我无法重现您的问题,但似乎每个人都按预期工作
Is it possible you have a non-nullable column that you are writing null values to? 您是否有向其写入空值的不可为空的列?
from google.cloud import spanner
client = spanner.Client()
database = client.instance('testinstance').database('testdatabase')
table_name='inttable'
query = f'''
SELECT
t.column_name,
t.spanner_type,
t.is_nullable
FROM
information_schema.columns AS t
WHERE
t.table_name = '{table_name}'
'''
with database.snapshot() as snapshot:
print(list(snapshot.execute_sql(query)))
# [['nonnullable', 'INT64', 'NO'], ['nullable', 'INT64', 'YES']]
from google.cloud import spanner
import numpy as np
import pandas as pd
client = spanner.Client()
instance = client.instance('testinstance')
database = instance.database('testdatabase')
def insert(df):
with database.batch() as batch:
batch.insert(
table='inttable',
columns=(
'nonnullable', 'nullable'),
values=df.values.tolist()
)
print("Succeeds in inserting int rows.")
d = {'nonnullable': [1, 2], 'nullable': [3, 4]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)
print("Succeeds in inserting rows with None in nullable columns.")
d = {'nonnullable': [3, 4], 'nullable': [None, 6]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)
print("Fails (as expected) attempting to insert row with None in a nonnullable column fails as expected")
d = {'nonnullable': [5, None], 'nullable': [6, 0]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)
# Fails with "google.api_core.exceptions.FailedPrecondition: 400 nonnullable must not be NULL in table inttable."
My solution was to leave it as NaN
(it turns out NaN == 'nan'
). 我的解决方案是将其保留为
NaN
(原来是NaN == 'nan'
)。 Then, at the very end, as I went to insert into the Spanner DB, I replaced all NaN
with None
in the DF. 然后,最后,当我插入Spanner DB时,我在DF中将所有
NaN
替换为None
。 I used code from another SO answer: df.replace({pd.np.nan: None})
. 我使用了另一个SO答案中的代码:
df.replace({pd.np.nan: None})
。 Spanner was looking at the NaN
as a 'nan'
string and rejecting that for insertion into an Int64 column. Spanner将
NaN
视为'nan'
字符串,并拒绝将其插入Int64列。 None
is treated as NULL
and can get inserted into Spanner with no issue. None
视为NULL
,可以毫无问题地将其插入Spanner。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.