將Int64類型的Pandas Dataframe發送到GCP Spanner INT64列

Question

我正在使用Pandas Dataframes。 我有一個來自CSV的列，它是整數與null混合在一起。

我正在嘗試將其轉換並以盡可能通用的方式將其插入Spanner（以便將來的工作可以使用相同的代碼），這降低了我使用前哨變量的能力。 但是，DF無法處理純int列中的NaN ，因此您必須使用Int64 。 當我嘗試將其插入Spanner時，我得到一個錯誤，它不是int64類型，而純Python int可以工作。 有沒有一種自動的方法在插入過程中將Int64 Pandas值轉換為int值？ 再次，由於空值，在插入之前轉換列不起作用。 有其他解決方法嗎？

嘗試從系列中進行轉換是這樣的：

>>>s2=pd.Series([3.0,5.0])
>>>s2
0    3.0
1    5.0
dtype: float64
>>>s1=pd.Series([3.0,None])
>>>s1
0    3.0
1    NaN
dtype: float64
>>>df = pd.DataFrame(data=[s1,s2], dtype=np.int64)
>>>df
   0    1
0  3  NaN
1  3  5.0
>>>df = pd.DataFrame(data={"nullable": s1, "nonnullable": s2}, dtype=np.int64)

這最后一條命令產生錯誤ValueError: Cannot convert non-finite values (NA or inf) to integer

Answer 1

我無法重現您的問題，但似乎每個人都按預期工作

您是否有向其寫入空值的不可為空的列？

檢索Spanner表的架構

from google.cloud import spanner

client = spanner.Client()
database = client.instance('testinstance').database('testdatabase')
table_name='inttable'

query = f'''
SELECT
t.column_name,
t.spanner_type,
t.is_nullable
FROM
information_schema.columns AS t
WHERE
t.table_name = '{table_name}'
'''

with database.snapshot() as snapshot:
    print(list(snapshot.execute_sql(query)))
    # [['nonnullable', 'INT64', 'NO'], ['nullable', 'INT64', 'YES']]

從熊貓數據框插入到扳手

from google.cloud import spanner

import numpy as np
import pandas as pd

client = spanner.Client()
instance = client.instance('testinstance')
database = instance.database('testdatabase')


def insert(df):
    with database.batch() as batch:
        batch.insert(
            table='inttable',
            columns=(
                'nonnullable', 'nullable'),
            values=df.values.tolist()
        )

print("Succeeds in inserting int rows.")
d = {'nonnullable': [1, 2], 'nullable': [3, 4]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)

print("Succeeds in inserting rows with None in nullable columns.")
d = {'nonnullable': [3, 4], 'nullable': [None, 6]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)

print("Fails (as expected) attempting to insert row with None in a nonnullable column fails as expected")
d = {'nonnullable': [5, None], 'nullable': [6, 0]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)
# Fails with "google.api_core.exceptions.FailedPrecondition: 400 nonnullable must not be NULL in table inttable."

Answer 2

我的解決方案是將其保留為NaN （原來是NaN == 'nan' ）。 然后，最后，當我插入Spanner DB時，我在DF中將所有NaN替換為None 。 我使用了另一個SO答案中的代碼： df.replace({pd.np.nan: None}) 。 Spanner將NaN視為'nan'字符串，並拒絕將其插入Int64列。 None視為NULL ，可以毫無問題地將其插入Spanner。

將Int64類型的Pandas Dataframe發送到GCP Spanner INT64列

問題描述

2 個解決方案

解決方案1
0 2019-03-26 16:48:49

檢索Spanner表的架構

從熊貓數據框插入到扳手

解決方案2
0 已采納 2019-03-27 17:29:15

將Int64類型的Pandas Dataframe發送到GCP Spanner INT64列

問題描述

2 個解決方案

解決方案1 0 2019-03-26 16:48:49

檢索Spanner表的架構

從熊貓數據框插入到扳手

解決方案2 0 已采納 2019-03-27 17:29:15

解決方案1
0 2019-03-26 16:48:49

解決方案2
0 已采納 2019-03-27 17:29:15