简体   繁体   English

Java多线程在数据库中插入数百万条记录

[英]Java multi-threading insert millions of records in the database

I am new to java and new to multithreading. 我是Java新手,还是多线程新手。 Interviewers are asking me one question again and again that. 面试官一次又一次地问我一个问题。 "Given a csv file - if you asked to read a file in java, which has millions of records, and insert the records in database in less time." “给出一个csv文件-如果您要求读取java中的一个文件,该文件包含数百万条记录,然后在较短的时间内将这些记录插入数据库中。” Interviewer further asks me - how to make use of concepts like 'multithreading, batching and spring' to solve above problem ? 采访者还问我-如何利用“多线程,批处理和弹簧”之类的概念来解决上述问题?

I got following code on inernet but that is not looking good, you have any other choices than 'PreparedStatement' ? 我在Inernet上收到了以下代码,但效果并不好,除了'PreparedStatement',您还有其他选择吗? Even i can not see use of multithreadig in below code. 即使我在下面的代码中也看不到使用multithreadig。

  BufferedReader in = new BufferedReader(new FileReader(file)); java.util.List<String[]> allLines = new ArrayList<String[]>(); // used for something else String sql = "insert into test (a, b, c, d)” + " values (?,?,?,?)"; PreparedStatement pstmt = conn.prepareStatement(sql); int i=0; while ((line = in.readLine()) != null) { line = line.trim().replaceAll(" +", " "); String[] sp = line.split(" "); String msg = line.substring(line.indexOf(sp[5])); allLines.add(new String[]{sp[0] + " " + sp[1], sp[4], sp[5], msg}); pstmt.setString(1, sp[0] + " " + sp[1]); pstmt.setString(2, sp[4]); pstmt.setString(3, sp[5]); pstmt.setString(4, msg); pstmt.addBatch(); i++; if (i % 1000 == 0){ pstmt.executeBatch(); conn.commit(); } } pstmt.executeBatch(); 

Not a real answer, but to give you some pointers: 不是一个真正的答案,但是给您一些提示:

Note that there is a configurable limit on a sql-server for the max package size it can receive 请注意,对于sql-server,它可以接收的最大包大小有一个可配置的限制

  • Ask what the properties are for the csv file 询问CSV文件的属性是什么
    Whether you can assume that each entry represents something unique, not multiple lines which represents the same database entry 是否可以假设每个条目代表唯一的东西,而不是代表同一数据库条目的多行

  • Check what the primary key of that table is 检查该表的主键是什么

If uniqueness is given, you can do the import in parallel (split the file). 如果给出了唯一性,则可以并行进行导入(分割文件)。 Its probably a must to turn off the primary key, so the database does not lock the insert commands. 关闭主键可能是必须的,因此数据库不会锁定插入命令。

If uniqueness is not given you probably want to preprocess the files to make the entries unique. 如果未提供唯一性,则可能需要预处理文件以使条目唯一。

  • Considering the batch size: Well I am no database expert but I learned neither too large nor too small. 考虑批处理大小:我不是数据库专家,但我学到的知识都不大也不小。

  • not sure what you refer to with spring : the spring framework, maybe? 不知道您用spring指的是什么:spring框架,也许吗?

SQL inserts will lock the table from further operations until commit is issued. SQL插入将锁定该表,使其无法进行进一步的操作,直到发出提交为止。 So, All inserts will be FIFO in order. 因此,所有插入将按顺序是FIFO。 Remember ACID properties? 还记得ACID属性吗? from school? 从学校? Read it again. 再次阅读。
inserts cannot be done using multiple threads, no use. 插入不能使用多个线程来完成,没有用。 Because, those threads will, in turn, keep waiting to gain a lock on the table and you end up burning more time than you do in "for loop". 因为,这些线程将继续等待获得对表的锁定,并且最终消耗的时间比在“ for循环”中花费的时间更多。

The bulk insert is a provision provided in java to insert multiple entries in one go, However from database side, it is n inserts with one commit. 批量插入是Java中提供的一项规定,可以一次插入多个条目,但是从数据库方面来看,它是n个具有一次提交的插入。 its provided to simplify programming. 提供它以简化编程。

Now, the Solution. 现在,解决方案。

To insert millions of records into a database table, it could be attained by doing the following. 要将数百万条记录插入数据库表中,可以通过执行以下操作来实现。 Create a dozen of temp tables. 创建许多临时表。 create a dozen threads. 创建一打线程。 split your millions of records among these dozen threads, which insert data into respective tables. 将您的数百万条记录分割成十几个线程,这些线程将数据插入各自的表中。 At the end merge all data of these dozens of tables into your final table. 最后,将这几十个表的所有数据合并到最终表中。 You will be 12X faster than inserting in a single loop. 您将比单循环插入快12倍。 Performance of this method depends on your machine configuration too. 此方法的性能也取决于您的计算机配置。 You need sufficient cores and sufficient memory to do this. 您需要足够的核心和足够的内存来执行此操作。

for better performance, all these dozens of tables should not have indices, which slightly improves inserts performance. 为了获得更好的性能,所有这几十个表都不应具有索引,这会略微提高插入性能。

If you have a good server, go with 100 threads and 100 tables. 如果您有一个好的服务器,请使用100个线程和100个表。 you will be 100X faster than a single loop. 您将比单循环快100倍。

If you do such stuff on Live DB of any banks or retail companies, you will be fired before EOD. 如果您在任何银行或零售公司的Live DB上执行此类操作,则将在EOD前被解雇。 Such high-performance operations must be planned in advance and must be communicated to database administrators and shall proceed only upon receiving an approval email. 此类高性能操作必须预先计划,并且必须传达给数据库管理员,并且只有在收到批准电子邮件后才能进行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM