简体   繁体   English

pandas 的 to_sql() 方法将主键列作为 NULL 发送,即使该列不存在于 dataframe 中

[英]to_sql() method of pandas sends primary key column as NULL even if the column is not present in dataframe

I want to insert a data frame into the Snowflake database table.我想在Snowflake数据库表中插入一个数据框。 The database has columns like id which is a primary_key and event_id which is an integer field and it's also nullable .数据库有像id这样的列,它是一个primary_keyevent_id这是一个integer字段,它也是nullable

I have created a declarative_base() class using SQLAlchemy as shown below -我使用SQLAlchemy创建了一个declarative_base() class ,如下所示 -

class AccountUsageLoginHistory(Base):

    __tablename__ = constants.TABLE_ACCOUNT_USAGE_LOGIN_HISTORY
    __table_args__ = {
        'extend_existing':True,
        'schema' : os.environ.get('SCHEMA_NAME_AUDITS')
    }

    id = Column(Integer, Sequence('id_account_usage_login_history'), primary_key=True)
    event_id = Column(Integer, nullable=True)

The class stated above creates a table in the Snowflake database.上述 class 在Snowflake数据库中创建了一个表。

I have a data frame that has just one column event_id .我有一个只有一列event_id的数据框。

When I try to insert the data using pandas to_sql() method Snowflake returns me an error shown below -当我尝试使用 pandas to_sql()方法插入数据时,Snowflake 返回如下所示的错误 -

snowflake.connector.errors.ProgrammingError: 100072 (22000): 01991f2c-0be5-c903-0000-d5e5000c6cee: NULL result in a non-nullable column

This error is generated by snowflake because to_sql() is appending a column id and the values are set to null for each row of that column.此错误是由 snowflake 生成的,因为to_sql()附加了一个列id ,并且该列的每一行的值都设置为null

dataframe.to_sql(table_name, self.engine, index=False, method=pd_writer, if_exists="append")

Consider this as case 1 -将此视为案例 1 -

I tried to run an insert query directly to snowflake -我试图直接对雪花运行插入查询 -

insert into "SFOPT_TEST"."AUDITS"."ACCOUNT_USAGE_LOGIN_HISTORY" (ID, EVENT_ID) values(NULL, 33)

The query above returned me the same error -上面的查询返回了同样的错误 -

NULL result in a non-nullable column

The query stated above is how probably the to_sql() method might be doing.上面的查询是to_sql()方法可能执行的操作。

Consider this as case 2 -将此视为案例 2 -

I also tried to insert a row by executing the query stated below -我还尝试通过执行下面所述的查询来插入一行 -

insert into "SFOPT_TEST"."AUDITS"."ACCOUNT_USAGE_LOGIN_HISTORY" (EVENT_ID) values(33)

Now, this particular query has been executed successfully by inserting the data into the table and it has also auto-generated value for column id .现在,这个特定的查询已通过将数据插入表中而成功执行,并且它还为列id自动生成了值。

How can I make to_sql() method of pandas to use case 2 ?如何使to_sql()方法使用案例 2

Please note that pandas.DataFrame.to_sql() has by default parameter index=True which means that it will add an extra column (df.index) when inserting the data.请注意pandas.DataFrame.to_sql()有默认参数index=True这意味着它会在插入数据时添加一个额外的列(df.index)。

Some Databases like PostgreSQL have a data type serial which allows you to sequentially fill the column with incremental numbers.一些数据库如 PostgreSQL 有一个数据类型serial允许你用递增的数字顺序填充列。

Snowflake DB doesn't have that concept but instead, there are other ways to handle it: Snowflake DB 没有这个概念,而是有其他方法来处理它:

First Option: You can use CREATE SEQUENCE statement and create a sequence directly in the db - here is the official documentation on this topic.第一个选项:您可以使用CREATE SEQUENCE语句并直接在数据库中创建一个序列 - 这里是关于这个主题的官方文档。 The downside of this approach is that you would need to convert your DataFrame into a proper SQL statement:这种方法的缺点是您需要将 DataFrame 转换为正确的 SQL 语句:

db preparation part:数据库准备部分:

CREATE OR REPLACE SEQUENCE schema.my_sequence START = 1 INCREMENT = 1;
CREATE OR REPLACE TABLE schema.my_table (i bigint, b text);

You would need to convert the DataFrame into Snowflake's INSERT statement and use schema.my_sequence.nextval to get the next ID value您需要将 DataFrame 转换为 Snowflake 的INSERT语句并使用schema.my_sequence.nextval获取下一个 ID 值

INSERT INTO schema.my_table VALUES
(schema.my_sequence.nextval, 'string_1'),
(schema.my_sequence.nextval, 'string_2');

The result will be:结果将是:

i b
1 string_1
2 string_2

Please note that there are some limitations to this approach and you need to ensure that each insert statement you do this way will be successful as calling schema.my_sequence.nextval and not inserting it will mean that there will be gaps numbers.请注意,这种方法有一些限制,您需要确保以这种方式执行的每个插入语句都会成功,因为调用schema.my_sequence.nextval而不插入它意味着会有间隙数字。 To avoid it you can have a separate script that checks if the current insert was successful and if not it will recreate the sequence by calling:为了避免它,你可以有一个单独的脚本来检查当前插入是否成功,如果不成功,它将通过调用重新创建序列:

REPLACE SEQUENCE schema.my_sequence start = (SELECT max(i) FROM schema.my_table) increment = 1;

Alternative Option: You would need to create an extra function that runs the SQL to get the last i you inserted previously.替代选项:您需要创建一个额外的 function 来运行 SQL 以获得您之前插入的最后一个 i 。

SELECT max(i) AS max_i FROM schema.my_table;

and then update the index in your DataFrame before running to_sql()然后在运行to_sql()之前更新 DataFrame 中的index

df.index = range(max_i+1, len(df)+max_i+1)

This will ensure that your DataFrame index continues i in your table.这将确保您的 DataFrame 索引在您的表中继续。 Once that is done you can use完成后,您可以使用

df.to_sql(index_label='i', name='my_table', con=connection_object)

It will use your index as one of the columns you insert allowing you to maintain the unique index in the table.它将使用您的索引作为您插入的列之一,允许您维护表中的唯一索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM