简体   繁体   中英

postgresql batch insert performance issue (with respect to mysql)

I have a lot of data and I want to insert to DB at least time. I do some test. I create a table (using below script *1) with 21 column. 1 column is int, others 20 columns are string. There is no index. I write a test code, create a random values and insert to DB (using insert sql *2). Before to run sql command, call conn.setAutoCommit(false) than conn.commit(). This operation about 6-7 seconds. There is official document (*3)said that use "COPY" command for bulk insert. Create similar an ascii file and re-test it, this test finished about 5 seconds. Using same test code on same machine, insert same data to Mysql, test less than 1 second . I realy suprise that with respect to 6-7 seconds great performance improvement. Is this differ really exists or I overlook anything.

Thanks for helps

My test configuration is solaris 10 and PostgreSQL 9.0.2 and Mysql 5.0.85.

(*1) PostgreSQL Create DB Script

CREATE TABLE tablo
(
  id integer,
  column1 character varying(50),
  column2 character varying(50),
  column3 character varying(50),
  ....
  column20 character varying(50)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE tablo OWNER TO pgadmin;

(*2)Insert query

INSERT INTO tablo values (1,'column67062724628797','column26007603757271','column73982294239806','column43213154421324','column97722282440805','column79000889379973','column10680880337755','column14322827996050','column80720842739399','column22777514445036','column77771307997926','column92799724462613','column89992937353110','column61693061355353','column43804223262229','column62209656630047','column52150955786400','column85726157993572','column33358888005133','column77743799989746'),(2,'column77383691774831','column67841193885377','column36149612452454','column51161680852595','column91649734476301','column57283307765550','column14997046117948','column29457857794726','column91157683305554','column44413196495111','column40702778794938','column24744999726868','column38356057278249','column16808618337554','column64362413535503','column19577167594144','column72943639162993','column46830376244427','column01942608599939','column66479131355003'),
....
10K lines

(*3) Official PostgreSql document address http://www.postgresql.org/docs/8.3/interactive/populate.html

Seems odd that you're not seeing a speedup with things like using COPY. I generated a script to create a similar table and populate it with 10,000 rows, and found that:

  1. Slowest: 1 row per insert, no transaction block, fsync=on
  2. Faster: wrap all statements (create table, insert....) in a transaction
  3. Same speed as 2: as 1 but turn fsync=off
  4. Fastest: restore using a script created by pg_dump to populate the table using COPY

Methods 2 and 3 were about 4 times faster than method 1. Method 4 was about 10 times faster than 2 or 3.

If I import the same data into mysql on my machine, it takes about half the time as methods 2 or 3. Dumping and reloading it, same. Dumping with -e and reloading it, same. Using InnoDB bumped the time up to being the same as methods 2 or 3.

So at least on my hardware/OS combination the speeds between the two are comparable... although of course I look after postgresql's settings better, but for a small table like this I wouldn't expect things like the buffer cache size to matter much?

Now, as to how good the JDBC support for doing batch inserts is, I have no idea. I did all these things using just the command-line clients.

There are two major considerations here:

  1. mysql is designed for speed over everything else, postgres is not, the default mysql table type does not even handle transactions... these extra features supported by postgres can add a fair bit of overhead in some situations to make them possible...
  2. This is not a very common use case - inserting 10000 entries at once is extremely uncommon in most applications. What is more common is inserting small chunks of data from different connections simultaneously in this case postgres uses row-level locking - a feature not supported by the default mysql table type, which (theoretically at least) should cause it to outperform mysql.

So, unless bulk inserts from a single connection are the norm for your application, this test really doesn't show anything useful... It is more likely you will be using dozens of connections simultaneously to insert, query and/or update small chunks of data

  1. create parent table without any index, only column and there types (create table some_data (c_1 int, c_2 varchar,....))
  2. create sequence for new data tables enumeration
  3. take new id from sequence
  4. create new table for real data with 'like' key word (create table some_data_X like some_data)
  5. insert real data in some_data_X with copy in binary format
  6. create indexes, constraints (empower your cores using multiple connections to postgresql)
  7. inherit parent table
  8. now ready to select!

In such way I have achieve 400000-500000 inserts per seconds with index creation on a 10 columns (2 xeon, 24 cores, 24 Gb of memory, SSD).

  1. Bonus: in separete thread remove old data (some_data_X with min X): huge circular buffer with indexing!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM