简体   繁体   English

是否值得为sql插入打开多个数据库连接

[英]Is it worth to open multiple database connections for sql insertion

I am writing a project related to mass data fetching.我正在编写一个与海量数据获取相关的项目。 Currently I am using .NET Framework 4.8 and the Mysql package to start connection and insert data to the database server.目前我正在使用 .NET Framework 4.8 和 Mysql 包来启动连接并将数据插入数据库服务器。

I am going to insert around 400, 000 line/second.我将插入大约 400, 000 行/秒。 I am concern that the SQL connection may become the bottleneck of my program.我担心 SQL 连接可能会成为我程序的瓶颈。 I would like to know if I create a multitthread connection with the sql and insert the data using a consumer queue, would it be faster and is it worth it(pros and cons)?我想知道如果我用 sql 创建一个多线程连接并使用消费者队列插入数据,它会更快,是否值得(优点和缺点)?

In my intuitive thought it would be faster but I am not sure how much performance it can provide with respect of the overhead for threads.在我的直觉中,它会更快,但我不确定它在线程开销方面可以提供多少性能。 I am not a SQL expert so it would be nice if someone could explain the pros and cons of opening multiple connections to a SQL on multiple threads.我不是 SQL 专家,所以如果有人能解释在多个线程上打开多个 SQL 连接的优缺点,那就太好了。

Rumors, opinions, hearsay, facts, version-dependent benchmarks, some personal experience, etc...谣言、观点、传闻、事实、版本相关的基准、一些个人经验等……

Multiple threads will improve throughput, but with limits:多线程将提高吞吐量,但有限制:

  • Throughput is capped at about half the theoretical limit.吞吐量被限制在理论限制的一半左右。 (Your "certain percentage") (This is based on benchmarks from a multi-threaded package; I forget the name; it was a decade ago.) (你的“一定百分比”)(这是基于多线程包的基准;我忘记了名字;那是十年前的事了。)
  • Multiple threads will compete with each other over Mutexes and other necessary locking mechanisms.多个线程将通过互斥锁和其他必要的锁定机制相互竞争。
  • As of about 5.7, 64 threads was MySQL's limit for multi-threading;在 5.7 左右,64 个线程是 MySQL 对多线程的限制; above that, throughput stalled or even dropped.在此之上,吞吐量停滞甚至下降。 (Source: Many Oracle benchmarks bragging about how much better one version was than the previous.) (Meanwhile, latency for each thread went through the roof.) (来源:许多 Oracle 基准测试吹嘘一个版本比前一个版本好多少。)(同时,每个线程的延迟都达到了顶峰。)
  • Each thread should batch the data if possible.如果可能,每个线程都应该批处理数据。

Batching:批处理:

  • LOAD DATA is the fastest way to INSERT lots of rows from a single thread at a single time. LOAD DATA是一次从单个线程INSERT大量行的最快方法。 But if you include the cost of writing the file to LOAD , that may make it effectively slower than batched inserting.但是,如果您包括将文件写入LOAD的成本,这可能会使其实际上比批量插入慢。
  • Batched INSERT is a close second.批量INSERT紧随其后。 But it caps out at "hundreds" of rows, when it hits either some limit or "diminishing returns".但是,当它达到某个限制或“收益递减”时,它会限制在“数百”行。
  • Batched Inserting is 10 times as fast as inserting one row per INSERT query.批量插入的速度是每个INSERT查询插入一行的 10 倍。 So it (or LOAD DATA ) is worth using for high speed ingestion.所以它(或LOAD DATA )值得用于高速摄取。 (Source: many different timed tests.) (来源:许多不同的定时测试。)

Source of data:数据来源:

  • Some data sources necessarily deliver only one row at a time (eg, sensor data from vehicles every N seconds.) This begs from some intermediate layer to batch the data.一些数据源必须一次只提供一行(例如,每 N 秒来自车辆的传感器数据)。这要求某个中间层对数据进行批处理。
  • A discussion of gathering data: http://mysql.rjweb.org/doc.php/staging_table收集数据的讨论:http: //mysql.rjweb.org/doc.php/staging_table

What happens after loading the data?加载数据后会发生什么? Surely this is not a write-only-never-read table.当然,这不是一个只写永不读的表。

  • Normalization is useful for shrinking the disk footprint;规范化对于缩小磁盘占用空间很有用; this is best done in batches.这最好分批完成。 See Normalization归一化
  • PARTITIONing is rarely useful, except for eventual purging of old data. PARTITIONing很少有用,除了最终清除旧数据。 See Partition分区
  • A huge 'Fact' table is hard to search;一个巨大的“事实”表很难搜索; consider building Summary data as you ingest he data: Summary Tables考虑在摄取数据时构建摘要数据:摘要表
  • It may even be practical to do the above processing, then toss the raw data.甚至可以进行上述处理,然后折腾原始数据。 It sounds like you might be acquiring a terabyte of data per day.听起来您每天可能会获取 1 TB 的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM