简体   繁体   English

SQLAlchemy 核心选择查询比原始 SQL 慢得多

[英]SQLAlchemy Core select query significantly slower than raw SQL

I've been running a Jupyter Notebook with Python3 code for quite some time.我使用 Python3 代码运行 Jupyter Notebook 已经有一段时间了。 It uses a combination of pyodbc and SQLAlchemy to connect to a few SQL Server databases on my company intranet.它使用 pyodbc 和 SQLAlchemy 的组合连接到我公司内部网上的几个 SQL Server 数据库。 The purpose of the notebook is to pull data from an initial SQL Server database and store it in memory as a Pandas dataframe. Notebook 的目的是从初始 SQL Server 数据库中提取数据并将其作为 Pandas 数据帧存储在内存中。 The file then extracts specific values from one of the columns and sends that list of values through to two different SQL Server databases to pull back a mapping list.然后,该文件从其中一列中提取特定值,并将该值列表发送到两个不同的 SQL Server 数据库以拉回映射列表。

All of this has been working great until I decided to rewrite the raw SQL queries as SQLAlchemy Core statements.在我决定将原始 SQL 查询重写为 SQLAlchemy Core 语句之前,所有这些都运行良好。 I've gone though the process of validating that the SQLAlchemy queries compile to match the raw SQL queries.我已经完成了验证 SQLAlchemy 查询编译以匹配原始 SQL 查询的过程。 However, the queries run unimaginably slow.但是,查询运行速度慢得难以想象。 For instance, the initial raw SQL query runs in 25 seconds and the same query rewritten in SQLAlchemy Core runs in 15 minutes!例如,初始原始 SQL 查询在 25 秒内运行,而在 SQLAlchemy Core 中重写的相同查询在 15 分钟内运行! The remaining queries didn't finish even after letting them run for 2 hours.剩余的查询即使在让它们运行 2 小时后也没有完成。

This could have something to do with how I'm reflecting the existing tables.这可能与我如何反映现有表格有关。 I even took some time to override the ForeignKey and primary_key on the tables hoping that'd help improve performance.我什至花了一些时间来覆盖表上的 ForeignKey 和 primary_key,希望这有助于提高性能。 No dice.没有骰子。

I also know "if it ain't broke, don't fix it."我也知道“如果它没有坏,就不要修理它”。 But SQLAlchemy just looks so much nicer than a nasty block of hard coded SQL.但是 SQLAlchemy 只是看起来比硬编码 SQL 的讨厌块要好得多。

Can anyone explain why the SQLAlchemy queries are running so slowly.任何人都可以解释为什么 SQLAlchemy 查询运行如此缓慢。 The SQLAlchemy docs don't give much insight. SQLAlchemy 文档没有提供太多见解。 I'm running SQLAlchemy version 1.2.11.我正在运行 SQLAlchemy 1.2.11 版。

import sqlalchemy
sqlalchemy.__version__
'1.2.11'

Here are the relevant lines.这是相关的行。 I'm excluding the exports for brevity but in case anyone needs to see that I'll be happy to supply it.为简洁起见,我排除了出口,但如果有人需要看到我很乐意提供它。

engine_dr2 = create_engine("mssql+pyodbc://{}:{}@SER-DR2".format(usr, pwd))
conn = engine_dr2.connect()

metadata_dr2 = MetaData()
bv = Table('BarVisits', metadata_dr2, autoload=True, autoload_with=engine_dr2, schema='livecsdb.dbo')
bb = Table('BarBillsUB92Claims', metadata_dr2, autoload=True, autoload_with=engine_dr2, schema='livecsdb.dbo')

mask = data['UCRN'].str[:2].isin(['DA', 'DB', 'DC'])
dr2 = data.loc[mask, 'UCRN'].unique().tolist()

sql_dr2 = select([bv.c.AccountNumber.label('Account_Number'),
                  bb.c.UniqueClaimReferenceNumber.label('UCRN')])
sql_dr2 = sql_dr2.select_from(bv.join(bb, and_(bb.c.SourceID == bv.c.SourceID,
                                               bb.c.BillingID == bv.c.BillingID)))
sql_dr2 = sql_dr2.where(bb.c.UniqueClaimReferenceNumber.in_(dr2))

mapping_list = pd.read_sql(sql_dr2, conn)
conn.close()

The raw SQL query that should match sql_dr2 and runs lickety split is here:应该匹配sql_dr2并运行sql_dr2 split 的原始 SQL 查询在这里:

"""SELECT Account_Number = z.AccountNumber, UCRN = y.UniqueClaimReferenceNumber 
FROM livecsdb.dbo.BarVisits z 
INNER JOIN 
livecsdb.dbo.BarBillsUB92Claims y
ON 
y.SourceID = z.SourceID 
AND 
y.BillingID = z.BillingID
WHERE
y.UniqueClaimReferenceNumber IN ({0})""".format(', '.join(["'" + acct + "'" for acct in dr2]))

The list dr2 typically contains upwards of 70,000 elements.列表dr2通常包含dr2 70,000 个元素。 Again, the raw SQL handles this in one minute or less.同样,原始 SQL 会在一分钟或更短的时间内处理这个问题。 The SQLAlchemy rewrite has been running for 8+ hours now and still not done. SQLAlchemy 重写现在已经运行了 8 多个小时,但仍未完成。

Update Additional information is provided below.更新下面提供了其他信息。 I don't own the database or the tables and they contain protected health information so it's not something I can directly share but I'll see about making some mock data.我不拥有数据库或表,它们包含受保护的健康信息,所以我不能直接分享它,但我会看到制作一些模拟数据。

tables = ['BarVisits', 'BarBillsUB92Claims']
for t in tables:
    print(insp.get_foreign_keys(t))
[], []

for t in tables:
    print(insp.get_indexes(t))
[{'name': 'BarVisits_SourceVisit', 'unique': False, 'column_names': ['SourceID', 'VisitID']}]
[]

for t in tables:
    print(insp.get_pk_constraint(t))
{'constrained_columns': ['BillingID', 'SourceID'], 'name': 'mtpk_visits'}
{'constrained_columns': ['BillingID', 'BillNumberID', 'ClaimID', 'ClaimInsuranceID', 'SourceID'], 'name': 'mtpk_insclaim'}

Thanks in advance for any insight.提前感谢您的任何见解。

I figured out how to make the query run fast but have no idea why it's needed with this server and not any others.我想出了如何使查询快速运行,但不知道为什么这个服务器需要它而不是其他任何服务器。

Taking服用

sql_dr2 = str(sql.compile(dialect=mssql.dialect(), compile_kwargs={"literal_binds": True}))

and sending that through并通过发送

pd.read_sql(sql_dr2, conn)

performs the query in about 2 seconds.在大约 2 秒内执行查询。

Again, I have no idea why that works but it does.同样,我不知道为什么会这样,但确实如此。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM