简体   繁体   English

MySQL使用Java从文件插入大型数据集

[英]MySQL Inserting large data sets from file with Java

I need to insert about 1.8 million rows from a CSV file into a MySQL database. 我需要从CSV文件中将大约180万行插入到MySQL数据库中。 (only one table) (只有一张桌子)

Currently using Java to parse through the file and insert each line. 目前使用Java来解析文件并插入每一行。

As you can imagine this takes quite a few hours to run. 你可以想象这需要花费几个小时才能运行。 (10 roughtly) (10)

The reason I'm not piping it straight in from the file into the db, is the data has to be manipulated before it adds it to the database. 我之所以没有将它直接从文件传输到数据库中,是因为数据必须在将数据添加到数据库之前进行操作。

This process needs to be run by an IT manager in there. 这个过程需要由那里的IT经理来运行。 So I've set it up as a nice batch file for them to run after they drop the new csv file into the right location. 所以我把它设置为一个很好的批处理文件,让它们在将新的csv文件放到正确的位置后运行。 So, I need to make this work nicely by droping the file into a certain location and running a batch file. 所以,我需要通过将文件放到某个位置并运行批处理文件来很好地完成这项工作。 (Windows enviroment) (Windows环境)

My question is, what way would be the fastest way to insert this much data; 我的问题是,插入这么多数据的最快方法是什么? large inserts, from a temp parsed file or one insert at a time? 大型插入,来自临时解析文件或一次插入一次? some other idea possibly? 可能还有其他想法吗?

The second question is, how can I optimize my MySQL installation to allow very quick inserts. 第二个问题是,如何优化我的MySQL安装以允许非常快速的插入。 (there will be a point where a large select of all the data is required as well) (还有一点需要大量选择所有数据)

Note: the table will be eventually droped and the whole process run again at a later date. 注意:该表最终将被删除,整个过程将在以后再次运行。

Some clarification: currently using ...opencsv.CSVReader to parse the file then doing an insert on each line. 一些澄清:目前使用... opencsv.CSVReader解析文件,然后在每一行上插入。 I'm concating some columns though and ignoring others. 我正在总结一些专栏而忽略其他专栏。

More clarification: Local DB MyISAM table 更多说明:本地DB MyISAM表

Tips for fast insertion: 快速插入提示:

  • Use the LOAD DATA INFILE syntax to let MySQL parse it and insert it, even if you have to mangle it and feed it after the manipulation. 使用LOAD DATA INFILE语法让MySQL解析并插入它,即使你必须修改它并在操作后提供它。
  • Use this insert syntax: 使用此插入语法:

    insert into table (col1, col2) values (val1, val2), (val3, val4), ... 插入表(col1,col2)值(val1,val2),(val3,val4),...

  • Remove all keys/indexes prior to insertion. 在插入之前删除所有键/索引。

  • Do it in the fastest machine you've got (IO-wise mainly, but RAM and CPU also matter). 在你所拥有的最快的机器中完成它(主要是IO,但RAM和CPU也很重要)。 Both the DB server, but also the inserting client, remember you'll be paying twice the IO price (once reading, the second inserting) 数据库服务器,还有插入客户端,记住你将支付两倍的IO价格(一次读取,第二次插入)

I'd probably pick a large number, like 10k rows, and load that many rows from the CSV, massage the data, and do a batch update, then repeat until you've gone through the entire csv. 我可能会选择一个很大的数字,比如10k行,然后从CSV加载那么多行,按下数据,然后进行批量更新,然后重复直到你完成了整个csv。 Depending on the massaging/amount of data 1.8 mil rows shouldn't take 10 hours, more like 1-2 hours depending on your hardware. 根据数据的按摩/数量,1.8 mil的行不应该花费10个小时,更多的是1-2个小时,具体取决于您的硬件。

edit: whoops, left out a fairly important part, your con has to have autocommit set to false, the code I copied this from was doing it as part of the GetConnection() method. 编辑:whoops,遗漏了一个相当重要的部分,你的con必须将autocommit设置为false,我复制它的代码是作为GetConnection()方法的一部分。

    Connection con = GetConnection();
con.setAutoCommit(false);
            try{
                PreparedStatement ps = con.prepareStatement("INSERT INTO table(col1, col2) VALUES(?, ?)");
                try{
                    for(Data d : massagedData){
                        ps.setString(1, d.whatever());
                                        ps.setString(2, d.whatever2());
                                            ps.addBatch();
                    }
                    ps.executeBatch();
                }finally{
                    ps.close();
                }
            }finally{
                con.close();
            }

Are you absolutely CERTAIN you have disabled auto commits in the JDBC driver? 您是否绝对禁止在JDBC驱动程序中禁用自动提交?

This is the typical performance killer for JDBC clients. 这是JDBC客户端的典型性能杀手。

You should really use LOAD DATA on the MySQL console itself for this and not work through the code... 你真的应该在MySQL控制台上使用LOAD DATA来实现这一点,而不是通过代码...

LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;

If you need to manipulate the data, I would still recommend manipulating in memory, rewriting to a flat file, and pushing it to the database using LOAD DATA, I think it should be more efficient. 如果你需要操作数据,我仍然建议在内存中操作,重写为平面文件,并使用LOAD DATA将其推送到数据库,我认为它应该更有效。

另一个想法是:您是否使用PreparedStatement通过JDBC插入数据?

Depending on what exactly you need to do with the data prior to inserting it your best options in terms of speed are: 根据您在插入数据之前需要对数据做些什么,您在速度方面的最佳选择是:

  • Parse the file in java / do what you need with the data / write the "massaged" data out to a new CSV file / use "load data infile" on that. 在java中解析文件/使用数据执行所需操作/将“按摩”数据写入新的CSV文件/使用“load data infile”。
  • If your data manipulation is conditional (eg you need to check for record existence and do different things based on whether it's an insert or and update, etc...) then (1) may be impossible. 如果您的数据操作是有条件的(例如,您需要检查记录是否存在并根据它是插入还是更新等来执行不同的操作......)那么(1)可能是不可能的。 In which case you're best off doing batch inserts / updates. 在这种情况下,您最好进行批量插入/更新。
    Experiment to find the best batch size working for you (starting with about 500-1000 should be ok). 尝试找到适合您的最佳批量大小(从大约500-1000开始应该没问题)。 Depending on the storage engine you're using for your table, you may need to split this into multiple transactions as well - having a single one span 1.8M rows ain't going to do wonders for performance. 根据您用于表的存储引擎,您可能需要将其拆分为多个事务 - 具有单个跨度1.8M行不会对性能产生奇迹。
  • Your biggest performance problem is most likely not java but mysql, in particular any indexes, constraints, and foreign keys you have on the table you are inserting into. 您最大的性能问题很可能不是java而是mysql,特别是您插入的表上的任何索引,约束和外键。 Before you begin your inserts, make sure you disable them. 在开始插入之前,请确保禁用它们。 Re-enabling them at the end will take a considerable amount of time, but it is far more efficient than having the database evaluate them after each statement. 在最后重新启用它们将花费相当多的时间,但它比在每个语句之后让数据库评估它们更有效。

    You may also be seeing mysql performance problems due to the size of your transaction. 由于事务的大小,您可能还会看到mysql性能问题。 Your transaction log will grow very large with that many inserts, so performing a commit after X number of inserts (say 10,000-100,000) will help insert speed as well. 您的事务日志将随着许多插入而变得非常大,因此在X次插入(例如10,000-100,000)之后执行提交也将有助于插入速度。

    From the jdbc layer, make sure you are using the addBatch() and executeBatch() commands rather on your PreparedStatement rather than the normal executeUpdate(). 从jdbc层,确保在PreparedStatement而不是普通的executeUpdate()上使用addBatch()和executeBatch()命令。

    You can improve bulk INSERT performance from MySQL / Java by using the batching capability in its Connector J JDBC driver. 您可以通过其Connector J JDBC驱动程序中的批处理功能来提高MySQL / Java的批量INSERT性能。

    MySQL doesn't "properly" handle batches (see my article link, bottom), but it can rewrite INSERTs to make use of quirky MySQL syntax, eg you can tell the driver to rewrite two INSERTs: MySQL没有“正确”处理批处理(参见我的文章链接,底部),但它可以重写INSERT以利用奇怪的MySQL语法,例如,您可以告诉驱动程序重写两个INSERT:

    INSERT INTO (val1, val2) VALUES ('val1', 'val2'); 
    INSERT INTO (val1, val2) VALUES ('val3', 'val4');
    

    as a single statement: 作为单一声明:

    INSERT INTO (val1, val2) VALUES ('val1', 'val2'), ('val3','val4'); 
    

    (Note that I'm not saying you need to rewrite your SQL in this way; the driver does it when it can) (请注意,我并不是说需要以这种方式重写SQL; 驱动程序可以这样做)

    We did this for a bulk insert investigation of our own: it made an order of magnitude of difference. 我们这样做是为了我们自己的批量插入调查:它产生了一个数量级的差异。 Used with explicit transactions as mentioned by others and you'll see a big improvement overall. 与其他人提到的显式交易一起使用,您将看到总体上有很大改进。

    The relevant driver property setting is: 相关的驱动程序属性设置为:

    jdbc:mysql:///<dbname>?rewriteBatchedStatements=true
    

    See: A 10x Performance Increase for Batch INSERTs With MySQL Connector/J Is On The Way 请参阅: 使用MySQL Connector / J批量INSERT的性能提升10倍

    如果你使用LOAD DATA INFILE而不是插入每一行,会不会更快?

    I would run three threads... 我会跑三个线程......

    1) Reads the input file and pushes each row into a transformation queue 2) Pops from the queue, transforms the data, and pushes into a db queue 3) Pops from the db queue and inserts the data 1)读取输入文件并将每一行推入转换队列2)从队列中弹出,转换数据,并推入db队列3)从db队列弹出并插入数据

    In this manner, you can be reading data from disk while the db threads are waiting for their IO to complete and vice-versa 通过这种方式,您可以在db线程等待其IO完成时从磁盘读取数据,反之亦然

    If you're not already, try using the MyISAM table type, just be sure to read up on its shortcomings before you do. 如果你还没有,请尝试使用MyISAM表类型,只需确保在你做之前阅读它的缺点。 It is generally faster than the other types of tables. 它通常比其他类型的表更快。

    If your table has indexes, it is usually faster to drop them then add them back after the import. 如果您的表具有索引,则删除它们通常会更快,然后在导入后将其添加回来。

    If your data is all strings, but is better suited as a relational database, you'll be better off inserting integers that indicate other values rather than storing a long string. 如果您的数据都是字符串,但更适合作为关系数据库,那么最好插入指示其他值的整数而不是存储长字符串。

    But in general, yes adding data to a database takes time. 但总的来说,是的向数据库添加数据需要时间。

    这是一个有趣的读物: http//dev.mysql.com/doc/refman/5.1/en/insert-speed.html

    声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

     
    粤ICP备18138465号  © 2020-2024 STACKOOM.COM