简体   繁体   中英

Java multi-threading insert millions of records in the database

I am new to java and new to multithreading. Interviewers are asking me one question again and again that. "Given a csv file - if you asked to read a file in java, which has millions of records, and insert the records in database in less time." Interviewer further asks me - how to make use of concepts like 'multithreading, batching and spring' to solve above problem ?

I got following code on inernet but that is not looking good, you have any other choices than 'PreparedStatement' ? Even i can not see use of multithreadig in below code.

  BufferedReader in = new BufferedReader(new FileReader(file)); java.util.List<String[]> allLines = new ArrayList<String[]>(); // used for something else String sql = "insert into test (a, b, c, d)” + " values (?,?,?,?)"; PreparedStatement pstmt = conn.prepareStatement(sql); int i=0; while ((line = in.readLine()) != null) { line = line.trim().replaceAll(" +", " "); String[] sp = line.split(" "); String msg = line.substring(line.indexOf(sp[5])); allLines.add(new String[]{sp[0] + " " + sp[1], sp[4], sp[5], msg}); pstmt.setString(1, sp[0] + " " + sp[1]); pstmt.setString(2, sp[4]); pstmt.setString(3, sp[5]); pstmt.setString(4, msg); pstmt.addBatch(); i++; if (i % 1000 == 0){ pstmt.executeBatch(); conn.commit(); } } pstmt.executeBatch(); 

Not a real answer, but to give you some pointers:

Note that there is a configurable limit on a sql-server for the max package size it can receive

  • Ask what the properties are for the csv file
    Whether you can assume that each entry represents something unique, not multiple lines which represents the same database entry

  • Check what the primary key of that table is

If uniqueness is given, you can do the import in parallel (split the file). Its probably a must to turn off the primary key, so the database does not lock the insert commands.

If uniqueness is not given you probably want to preprocess the files to make the entries unique.

  • Considering the batch size: Well I am no database expert but I learned neither too large nor too small.

  • not sure what you refer to with spring : the spring framework, maybe?

SQL inserts will lock the table from further operations until commit is issued. So, All inserts will be FIFO in order. Remember ACID properties? from school? Read it again.
inserts cannot be done using multiple threads, no use. Because, those threads will, in turn, keep waiting to gain a lock on the table and you end up burning more time than you do in "for loop".

The bulk insert is a provision provided in java to insert multiple entries in one go, However from database side, it is n inserts with one commit. its provided to simplify programming.

Now, the Solution.

To insert millions of records into a database table, it could be attained by doing the following. Create a dozen of temp tables. create a dozen threads. split your millions of records among these dozen threads, which insert data into respective tables. At the end merge all data of these dozens of tables into your final table. You will be 12X faster than inserting in a single loop. Performance of this method depends on your machine configuration too. You need sufficient cores and sufficient memory to do this.

for better performance, all these dozens of tables should not have indices, which slightly improves inserts performance.

If you have a good server, go with 100 threads and 100 tables. you will be 100X faster than a single loop.

If you do such stuff on Live DB of any banks or retail companies, you will be fired before EOD. Such high-performance operations must be planned in advance and must be communicated to database administrators and shall proceed only upon receiving an approval email.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM