[英]to_sql pyodbc count field incorrect or syntax error
I am downloading Json data from an api website and using sqlalchemy, pyodbc and pandas' to_sql function to insert that data into a MSSQL server.我正在从 api 网站下载 Json 数据,并使用 sqlalchemy、pyodbc 和 pandas 的 to_sql 函数将该数据插入到 MSSQL 服务器中。
I can download up to 10000 rows, however I have to limit the chunksize to 10 otherwise I get the following error:我最多可以下载 10000 行,但是我必须将块大小限制为 10,否则我会收到以下错误:
DBAPIError: (pyodbc.Error) ('07002', '[07002] [Microsoft][SQL Server Native Client 11.0]COUNT field incorrect or syntax error (0) (SQLExecDirectW)') [SQL: 'INSERT INTO [TEMP_producing_entity_details]
DBAPIError: (pyodbc.Error) ('07002', '[07002] [Microsoft][SQL Server Native Client 11.0]COUNT 字段不正确或语法错误 (0) (SQLExecDirectW)') [SQL: 'INSERT INTO [TEMP_production_entity_details]
There are around 500 Million rows to download, it's just crawling at this speed.大约有 5 亿行要下载,它只是以这种速度爬行。 Any advice on a workaround?
有关解决方法的任何建议?
Thanks,谢谢,
At the time this question was asked, pandas 0.23.0 had just been released.在提出这个问题时,pandas 0.23.0 刚刚发布。 That version changed the default behaviour of
.to_sql()
from calling the DBAPI .executemany()
method to constructing a table-value constructor (TVC) that would improve upload speed by inserting multiple rows with a single .execute()
call of an INSERT statement.该版本将
.to_sql()
的默认行为从调用 DBAPI .executemany()
方法更改为构造一个表值构造函数 (TVC),该构造函数将通过插入多行并使用单个.execute()
调用插入多行来提高上传速度陈述。 Unfortunately that approach often exceeded T-SQL's limit of 2100 parameter values for a stored procedure, leading to the error cited in the question.不幸的是,这种方法经常超过 T-SQL 对存储过程的 2100 个参数值的限制,从而导致问题中引用的错误。
Shortly thereafter, a subsequent release of pandas added a method=
argument to .to_sql()
.此后不久,pandas 的后续版本在
.to_sql()
添加了一个method=
参数。 The default – method=None
– restored the previous behaviour of using .executemany()
, while specifying method="multi"
would tell .to_sql()
to use the newer TVC approach.默认 -
method=None
- 恢复了之前使用.executemany()
行为,同时指定method="multi"
会告诉.to_sql()
使用更新的 TVC 方法。
Around the same time, SQLAlchemy 1.3 was released and it added a fast_executemany=True
argument to create_engine()
which greatly improved upload speed using Microsoft's ODBC drivers for SQL Server.大约在同一时间,SQLAlchemy 1.3 发布,它为
create_engine()
添加了一个fast_executemany=True
参数,这大大提高了使用 Microsoft 的 SQL Server ODBC 驱动程序的上传速度。 With that enhancement, method=None
proved to be at least as fast as method="multi"
while avoiding the 2100-parameter limit.通过这种增强,
method=None
被证明至少与method="multi"
一样快,同时避免了 2100 个参数的限制。
So with current versions of pandas, SQLAlchemy, and pyodbc, the best approach for using .to_sql()
with Microsoft's ODBC drivers for SQL Server is to use fast_executemany=True
and the default behaviour of .to_sql()
, ie,因此,对于当前版本的
.to_sql()
、SQLAlchemy 和 pyodbc,将.to_sql()
与 Microsoft 的 SQL Server ODBC 驱动程序一起使用的最佳方法是使用fast_executemany=True
和.to_sql()
的默认行为,即,
connection_uri = (
"mssql+pyodbc://scott:tiger^5HHH@192.168.0.199/db_name"
"?driver=ODBC+Driver+17+for+SQL+Server"
)
engine = create_engine(connection_uri, fast_executemany=True)
df.to_sql("table_name", engine, index=False, if_exists="append")
This is the recommended approach for apps running on Windows, macOS, and the Linux variants that Microsoft supports for its ODBC driver.对于在 Windows、macOS 和 Microsoft 为其 ODBC 驱动程序支持的 Linux 变体上运行的应用程序,这是推荐的方法。 If you need to use FreeTDS ODBC, then
.to_sql()
can be called with method="multi"
and chunksize=
as described below.如果您需要使用
.to_sql()
ODBC,则可以使用method="multi"
和chunksize=
调用.to_sql()
,如下所述。
(Original answer) (原答案)
Prior to pandas version 0.23.0, to_sql
would generate a separate INSERT for each row in the DataTable:在 pandas 0.23.0 版本之前,
to_sql
会为 DataTable 中的每一行生成一个单独的 INSERT:
exec sp_prepexec @p1 output,N'@P1 int,@P2 nvarchar(6)',
N'INSERT INTO df_to_sql_test (id, txt) VALUES (@P1, @P2)',
0,N'row000'
exec sp_prepexec @p1 output,N'@P1 int,@P2 nvarchar(6)',
N'INSERT INTO df_to_sql_test (id, txt) VALUES (@P1, @P2)',
1,N'row001'
exec sp_prepexec @p1 output,N'@P1 int,@P2 nvarchar(6)',
N'INSERT INTO df_to_sql_test (id, txt) VALUES (@P1, @P2)',
2,N'row002'
Presumably to improve performance, pandas 0.23.0 now generates a table-value constructor to insert multiple rows per call大概是为了提高性能,pandas 0.23.0 现在生成一个表值构造函数来在每次调用时插入多行
exec sp_prepexec @p1 output,N'@P1 int,@P2 nvarchar(6),@P3 int,@P4 nvarchar(6),@P5 int,@P6 nvarchar(6)',
N'INSERT INTO df_to_sql_test (id, txt) VALUES (@P1, @P2), (@P3, @P4), (@P5, @P6)',
0,N'row000',1,N'row001',2,N'row002'
The problem is that SQL Server stored procedures (including system stored procedures like sp_prepexec
) are limited to 2100 parameters, so if the DataFrame has 100 columns then to_sql
can only insert about 20 rows at a time.问题是 SQL Server 存储过程(包括像
sp_prepexec
这样的系统存储过程)被限制为 2100 个参数,所以如果 DataFrame 有 100 列,那么to_sql
一次只能插入大约 20 行。
We can calculate the required chunksize
using我们可以计算出所需要
chunksize
使用
# df is an existing DataFrame
#
# limit based on sp_prepexec parameter count
tsql_chunksize = 2097 // len(df.columns)
# cap at 1000 (limit for number of rows inserted by table-value constructor)
tsql_chunksize = 1000 if tsql_chunksize > 1000 else tsql_chunksize
#
df.to_sql('tablename', engine, index=False, if_exists='replace',
method='multi', chunksize=tsql_chunksize)
However, the fastest approach is still likely to be:但是,最快的方法仍然可能是:
dump the DataFrame to a CSV file (or similar), and then将 DataFrame 转储到 CSV 文件(或类似文件),然后
have Python call the SQL Server bcp
utility to upload that file into the table.让 Python 调用 SQL Server
bcp
实用程序将该文件上传到表中。
Made a few modifications based on Gord Thompson's answer.根据戈德汤普森的回答做了一些修改。 This will auto-calculate the chunksize and keep it to the lowest nearest integer value which fits in the 2100 parameters limit :
这将自动计算块大小并将其保持为适合 2100 个参数限制的最小最接近整数值:
import math
df_num_of_cols=len(df.columns)
chunknum=math.floor(2100/df_num_of_cols)
df.to_sql('MY_TABLE',con=engine,schema='myschema',chunksize=chunknum,if_exists='append',method='multi',index=False )
Don't have a reputation so I cannot comment on Amit S. I just tried this way, with chuknum calculated with the method set to 'multi' Still shows me the error:没有声誉,所以我无法评论 Amit S。我只是尝试过这种方式,使用设置为“multi”的方法计算的 chuknum 仍然显示错误:
[Microsoft][SQL Server Native Client 11.0][SQL Server]
The incoming request has too many parameters. [Microsoft][SQL Server Native Client 11.0][SQL Server]
传入的请求参数过多。 The server supports a maximum of 2100 parameters.服务器最多支持 2100 个参数。 Reduce the number of parameters and resend the request
减少参数数量并重新发送请求
So I just modified:所以我只是修改:
chunknum=math.floor(2100/df_num_of_cols)
to到
chunknum=math.floor(2100/df_num_of_cols) - 1
It seems now working perfectly.现在似乎工作完美。 I think should be some edge problem...
我想应该是一些边缘问题......
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.