简体繁体 English

是否值得为sql插入打开多个数据库连接

[英]Is it worth to open multiple database connections for sql insertion

原文 2022-06-28 02:05:12 3 1 c#/ mysql/ database/ performance/ connection

I am writing a project related to mass data fetching.我正在编写一个与海量数据获取相关的项目。 Currently I am using .NET Framework 4.8 and the Mysql package to start connection and insert data to the database server.目前我正在使用 .NET Framework 4.8 和 Mysql 包来启动连接并将数据插入数据库服务器。

I am going to insert around 400, 000 line/second.我将插入大约 400, 000 行/秒。 I am concern that the SQL connection may become the bottleneck of my program.我担心 SQL 连接可能会成为我程序的瓶颈。 I would like to know if I create a multitthread connection with the sql and insert the data using a consumer queue, would it be faster and is it worth it(pros and cons)?我想知道如果我用 sql 创建一个多线程连接并使用消费者队列插入数据，它会更快，是否值得（优点和缺点）？

In my intuitive thought it would be faster but I am not sure how much performance it can provide with respect of the overhead for threads.在我的直觉中，它会更快，但我不确定它在线程开销方面可以提供多少性能。 I am not a SQL expert so it would be nice if someone could explain the pros and cons of opening multiple connections to a SQL on multiple threads.我不是 SQL 专家，所以如果有人能解释在多个线程上打开多个 SQL 连接的优缺点，那就太好了。

1 个解决方案

Rumors, opinions, hearsay, facts, version-dependent benchmarks, some personal experience, etc...谣言、观点、传闻、事实、版本相关的基准、一些个人经验等……

Multiple threads will improve throughput, but with limits:多线程将提高吞吐量，但有限制：

Throughput is capped at about half the theoretical limit.吞吐量被限制在理论限制的一半左右。 (Your "certain percentage") (This is based on benchmarks from a multi-threaded package; I forget the name; it was a decade ago.) （你的“一定百分比”）（这是基于多线程包的基准；我忘记了名字；那是十年前的事了。）
Multiple threads will compete with each other over Mutexes and other necessary locking mechanisms.多个线程将通过互斥锁和其他必要的锁定机制相互竞争。
As of about 5.7, 64 threads was MySQL's limit for multi-threading;在 5.7 左右，64 个线程是 MySQL 对多线程的限制； above that, throughput stalled or even dropped.在此之上，吞吐量停滞甚至下降。 (Source: Many Oracle benchmarks bragging about how much better one version was than the previous.) (Meanwhile, latency for each thread went through the roof.) （来源：许多 Oracle 基准测试吹嘘一个版本比前一个版本好多少。）（同时，每个线程的延迟都达到了顶峰。）
Each thread should batch the data if possible.如果可能，每个线程都应该批处理数据。

Batching:批处理：

LOAD DATA is the fastest way to INSERT lots of rows from a single thread at a single time. LOAD DATA是一次从单个线程INSERT大量行的最快方法。 But if you include the cost of writing the file to LOAD , that may make it effectively slower than batched inserting.但是，如果您包括将文件写入LOAD的成本，这可能会使其实际上比批量插入慢。
Batched INSERT is a close second.批量INSERT紧随其后。 But it caps out at "hundreds" of rows, when it hits either some limit or "diminishing returns".但是，当它达到某个限制或“收益递减”时，它会限制在“数百”行。
Batched Inserting is 10 times as fast as inserting one row per INSERT query.批量插入的速度是每个INSERT查询插入一行的 10 倍。 So it (or LOAD DATA ) is worth using for high speed ingestion.所以它（或LOAD DATA ）值得用于高速摄取。 (Source: many different timed tests.) （来源：许多不同的定时测试。）

Source of data:数据来源：

Some data sources necessarily deliver only one row at a time (eg, sensor data from vehicles every N seconds.) This begs from some intermediate layer to batch the data.一些数据源必须一次只提供一行（例如，每 N 秒来自车辆的传感器数据）。这要求某个中间层对数据进行批处理。
A discussion of gathering data: http://mysql.rjweb.org/doc.php/staging_table收集数据的讨论：http: //mysql.rjweb.org/doc.php/staging_table

What happens after loading the data?加载数据后会发生什么？ Surely this is not a write-only-never-read table.当然，这不是一个只写永不读的表。

Normalization is useful for shrinking the disk footprint;规范化对于缩小磁盘占用空间很有用； this is best done in batches.这最好分批完成。 See Normalization见 归一化
PARTITIONing is rarely useful, except for eventual purging of old data. PARTITIONing很少有用，除了最终清除旧数据。 See Partition见分区
A huge 'Fact' table is hard to search;一个巨大的“事实”表很难搜索； consider building Summary data as you ingest he data: Summary Tables考虑在摄取数据时构建摘要数据：摘要表
It may even be practical to do the above processing, then toss the raw data.甚至可以进行上述处理，然后折腾原始数据。 It sounds like you might be acquiring a terabyte of data per day.听起来您每天可能会获取 1 TB 的数据。