简体   繁体   English

从CSV加载时PostgreSQL / JooQ批量插入性能问题; 如何改善流程?

[英]PostgreSQL/JooQ bulk insertion performance issues when loading from CSV; how do I improve the process?

For this project , I intend to make a web version and am right now working on making a PostgreSQL (9.x) backend from which the webapp will query. 对于这个项目 ,我打算制作一个Web版本,现在正在制作一个PostgreSQL(9.x)后端,该webapp可以从该后端查询。

Right now, what happens is that the tracer generates a zip file with two CSVs in it, load it into an H2 database at runtime whose schema is this (and yes, I'm aware that the SQL could be written a little better): 现在,发生的情况是,跟踪器生成了一个包含两个CSV的zip文件,并在运行时将其加载到架构如下的H2数据库中(是的,我知道SQL可以写得更好一些):

create table matchers (
    id integer not null,
    class_name varchar(255) not null,
    matcher_type varchar(30) not null,
    name varchar(1024) not null
);

alter table matchers add primary key(id);

create table nodes (
    id integer not null,
    parent_id integer not null,
    level integer not null,
    success integer not null,
    matcher_id integer not null,
    start_index integer not null,
    end_index integer not null,
    time bigint not null
);

alter table nodes add primary key(id);
alter table nodes add foreign key (matcher_id) references matchers(id);
create index nodes_parent_id on nodes(parent_id);
create index nodes_indices on nodes(start_index, end_index);

Now, since the PostgreSQL database will be able to handle more than one trace, I had to add a further table; 现在,由于PostgreSQL数据库将能够处理多个跟踪,因此我不得不添加另一个表。 the schema on the PostgreSQL backend looks like this (less than average SQL alert as well; also, in the parse_info table, the content column contains the full text of the file parsed, in the zip file it is stored separately): PostgreSQL后端上的模式如下所示(也比一般的SQL警报还少;此外,在parse_info表中, content列包含解析文件的全文,在zip文件中它是单独存储的):

create table parse_info (
    id uuid primary key,
    date timestamp not null,
    content text not null
);

create table matchers (
    parse_info_id uuid references parse_info(id),
    id integer not null,
    class_name varchar(255) not null,
    matcher_type varchar(30) not null,
    name varchar(1024) not null,
    unique (parse_info_id, id)
);

create table nodes (
    parse_info_id uuid references parse_info(id),
    id integer not null,
    parent_id integer not null,
    level integer not null,
    success integer not null,
    matcher_id integer not null,
    start_index integer not null,
    end_index integer not null,
    time bigint not null,
    unique (parse_info_id, id)
);

alter table nodes add foreign key (parse_info_id, matcher_id)
    references matchers(parse_info_id, id);
create index nodes_parent_id on nodes(parent_id);
create index nodes_indices on nodes(start_index, end_index);

Now, what I am currently doing is taking existing zip files and inserting them into a postgresql database; 现在,我目前正在做的工作是获取现有的zip文件并将其插入到postgresql数据库中。 I'm using JooQ and its CSV loading API . 我正在使用JooQ及其CSV加载API

The process is a little complicated... Here are the current steps: 该过程有点复杂...这是当前步骤:

  • a UUID is generated; 生成UUID;
  • I read the necessary info from the zip (parse date, input text) and write the record in the parse_info table; 我从zip中读取必要的信息(解析日期,输入文本),并将记录写在parse_info表中;
  • I create temporary copies of the CSV in order for the JooQ loading API to be able to use it (see after the code extract as to why); 我创建了CSV的临时副本,以便JooQ加载API能够使用它(请参见代码提取后的原因);
  • I insert all matchers, then all nodes. 我插入所有匹配器,然后插入所有节点。

Here is the code: 这是代码:

public final class Zip2Db2
{
    private static final Pattern SEMICOLON = Pattern.compile(";");
    private static final Function<String, String> CSV_ESCAPE
        = TraceCsvEscaper.ESCAPER::apply;

    // Paths in the zip to the different components
    private static final String INFO_PATH = "/info.csv";
    private static final String INPUT_PATH = "/input.txt";
    private static final String MATCHERS_PATH = "/matchers.csv";
    private static final String NODES_PATH = "/nodes.csv";

    // Fields to use for matchers zip insertion
    private static final List<Field<?>> MATCHERS_FIELDS = Arrays.asList(
        MATCHERS.PARSE_INFO_ID, MATCHERS.ID, MATCHERS.CLASS_NAME,
        MATCHERS.MATCHER_TYPE, MATCHERS.NAME
    );

    // Fields to use for nodes zip insertion
    private static final List<Field<?>> NODES_FIELDS = Arrays.asList(
        NODES.PARSE_INFO_ID, NODES.PARENT_ID, NODES.ID, NODES.LEVEL,
        NODES.SUCCESS, NODES.MATCHER_ID, NODES.START_INDEX, NODES.END_INDEX,
        NODES.TIME
    );

    private final FileSystem fs;
    private final DSLContext jooq;
    private final UUID uuid;

    private final Path tmpdir;

    public Zip2Db2(final FileSystem fs, final DSLContext jooq, final UUID uuid)
        throws IOException
    {
        this.fs = fs;
        this.jooq = jooq;
        this.uuid = uuid;

        tmpdir = Files.createTempDirectory("zip2db");
    }

    public void removeTmpdir()
        throws IOException
    {
        // From java7-fs-more (https://github.com/fge/java7-fs-more)
        MoreFiles.deleteRecursive(tmpdir, RecursionMode.KEEP_GOING);
    }

    public void run()
    {
        time(this::generateMatchersCsv, "Generate matchers CSV");
        time(this::generateNodesCsv, "Generate nodes CSV");
        time(this::writeInfo, "Write info record");
        time(this::writeMatchers, "Write matchers");
        time(this::writeNodes, "Write nodes");
    }

    private void generateMatchersCsv()
        throws IOException
    {
        final Path src = fs.getPath(MATCHERS_PATH);
        final Path dst = tmpdir.resolve("matchers.csv");

        try (
            final Stream<String> lines = Files.lines(src);
            final BufferedWriter writer = Files.newBufferedWriter(dst,
                StandardOpenOption.CREATE_NEW);
        ) {
            // Throwing below is from throwing-lambdas
            // (https://github.com/fge/throwing-lambdas)
            lines.map(this::toMatchersLine)
                .forEach(Throwing.consumer(writer::write));
        }
    }

    private String toMatchersLine(final String input)
    {
        final List<String> parts = new ArrayList<>();
        parts.add('"' + uuid.toString() + '"');
        Arrays.stream(SEMICOLON.split(input, 4))
            .map(s -> '"' + CSV_ESCAPE.apply(s) + '"')
            .forEach(parts::add);
        return String.join(";", parts) + '\n';
    }

    private void generateNodesCsv()
        throws IOException
    {
        final Path src = fs.getPath(NODES_PATH);
        final Path dst = tmpdir.resolve("nodes.csv");

        try (
            final Stream<String> lines = Files.lines(src);
            final BufferedWriter writer = Files.newBufferedWriter(dst,
                StandardOpenOption.CREATE_NEW);
        ) {
            lines.map(this::toNodesLine)
                .forEach(Throwing.consumer(writer::write));
        }
    }

    private String toNodesLine(final String input)
    {
        final List<String> parts = new ArrayList<>();
        parts.add('"' + uuid.toString() + '"');
        SEMICOLON.splitAsStream(input)
            .map(s -> '"' + CSV_ESCAPE.apply(s) + '"')
            .forEach(parts::add);
        return String.join(";", parts) + '\n';
    }

    private void writeInfo()
        throws IOException
    {
        final Path path = fs.getPath(INFO_PATH);

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            final String[] elements = SEMICOLON.split(reader.readLine());

            final long epoch = Long.parseLong(elements[0]);
            final Instant instant = Instant.ofEpochMilli(epoch);
            final ZoneId zone = ZoneId.systemDefault();
            final LocalDateTime time = LocalDateTime.ofInstant(instant, zone);

            final ParseInfoRecord record = jooq.newRecord(PARSE_INFO);

            record.setId(uuid);
            record.setContent(loadText());
            record.setDate(Timestamp.valueOf(time));

            record.insert();
        }
    }

    private String loadText()
        throws IOException
    {
        final Path path = fs.getPath(INPUT_PATH);

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            return CharStreams.toString(reader);
        }
    }

    private void writeMatchers()
        throws IOException
    {
        final Path path = tmpdir.resolve("matchers.csv");

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            jooq.loadInto(MATCHERS)
                .onErrorAbort()
                .loadCSV(reader)
                .fields(MATCHERS_FIELDS)
                .separator(';')
                .execute();
        }
    }

    private void writeNodes()
        throws IOException
    {
        final Path path = tmpdir.resolve("nodes.csv");

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            jooq.loadInto(NODES)
                .onErrorAbort()
                .loadCSV(reader)
                .fields(NODES_FIELDS)
                .separator(';')
                .execute();
        }
    }

    private void time(final ThrowingRunnable runnable, final String description)
    {
        System.out.println(description + ": start");
        final Stopwatch stopwatch = Stopwatch.createStarted();
        runnable.run();
        System.out.println(description + ": done (" + stopwatch.stop() + ')');
    }

    public static void main(final String... args)
        throws IOException
    {
        if (args.length != 1) {
            System.err.println("missing zip argument");
            System.exit(2);
        }

        final Path zip = Paths.get(args[0]).toRealPath();

        final UUID uuid = UUID.randomUUID();
        final DSLContext jooq = PostgresqlTraceDbFactory.defaultFactory()
            .getJooq();

        try (
            final FileSystem fs = MoreFileSystems.openZip(zip, true);
        ) {
            final Zip2Db2 zip2Db = new Zip2Db2(fs, jooq, uuid);
            try {
                zip2Db.run();
            } finally {
                zip2Db.removeTmpdir();
            }
        }
    }
}

Now, here is my first problem... It is much slower than loading into H2. 现在,这是我的第一个问题...比加载H2慢得多。 Here is a timing for a CSV containing 620 matchers and 45746 nodes: 这是包含620个匹配器和45746个节点的CSV的计时:

Generate matchers CSV: start
Generate matchers CSV: done (45.26 ms)
Generate nodes CSV: start
Generate nodes CSV: done (573.2 ms)
Write info record: start
Write info record: done (311.1 ms)
Write matchers: start
Write matchers: done (4.192 s)
Write nodes: start
Write nodes: done (22.64 s)

Give or take, and forgetting the part about writing specialized CSVs (see below), that is 25 seconds. 付出或付出,而忘记编写专业CSV的部分(请参见下文),即25秒。 Loading this into an on-the-fly, disk-based H2 database takes less than 5 seconds ! 将其加载到基于磁盘的即时H2数据库中只需不到5秒

The other problem I have is that I have to write dedicated CSVs; 我的另一个问题是必须编写专用的CSV。 it appears that the CSV loading API is not really flexible in what it accepts, and I have, for instance, to turn this line: 似乎CSV加载API在接受的内容上并不十分灵活,例如,我不得不改掉这行:

328;SequenceMatcher;COMPOSITE;token

into this: 到这个:

"some-randome-uuid-here";"328";"SequenceMatcher";"COMPOSITE";"token"

But my biggest problem is in fact that this zip is pretty small. 但是我最大的问题实际上是这个拉链很小。 For instance, I have a zip with not 620, but 1532 matchers, and not 45746 nodes, but more than 34 million nodes ; 例如,我的邮编中没有620,但有1532个匹配器,没有45746个节点,但超过3400万个节点 even if we dismiss the CSV generation time (the original nodes CSV is 1.2 GiB), since it takes 20 minutes for H2 injection, multiplying this by 5 gives a time some point south of 1h30mn, which is a lot! 即使我们忽略了CSV生成时间(原始节点CSV为1.2 GiB),由于H2注入需要20分钟,因此将其乘以5会得到1h30mn以南的某个时间,这是很多的!

All in all, the process is quite inefficient at the moment... 总而言之,该过程目前效率很低。


Now, in the defence of PostgreSQL: 现在,为捍卫PostgreSQL:

  • constraints on the PostgreSQL instance are much higher than those on the H2 instance: I don't need a UUID in generated zip files; PostgreSQL实例上的约束比H2实例上的约束高得多:我在生成的zip文件中不需要UUID。
  • H2 is tuned "insecurely" for writes: jdbc:h2:/path/to/db;LOG=0;LOCK_MODE=0;UNDO_LOG=0;CACHE_SIZE=131072 . H2被“不安全地”调整为写入: jdbc:h2:/path/to/db;LOG=0;LOCK_MODE=0;UNDO_LOG=0;CACHE_SIZE=131072

Still, this difference in insertion times seems a little excessive, and I am quite sure that it can be better. 不过,插入时间的这种差异似乎有点过分,而且我相信它会更好。 But I don't know where to start. 但是我不知道从哪里开始。

Also, I am aware that PostgreSQL has a dedicated mechanism to load from CSVs, but here the CSVs are in a zip file to start with, and I'd really like to avoid having to create a dedicated CSV as I am currently doing... Ideally I'd like to read line by line from the zip directly (which is what I do for H2 injection), transform the line and write into the PostgreSQL schema. 另外,我知道PostgreSQL有一种专用的从CSV加载的机制,但是这里的CSV是从一个zip文件开始的,我真的想避免像我目前所做的那样创建专用的CSV。理想情况下,我想直接从zip逐行读取(这是我对H2注入所做的工作),将其转换并写入PostgreSQL模式。

Finally, I am also aware that I currently do not disable constraints on the PostgreSQL schema before insertion; 最后,我还知道,我目前在插入之前未禁用PostgreSQL模式上的约束; I have yet to try this (will it make a difference?). 我还没有尝试过(会有所作为吗?)。

So, what do you suggest I do to improve the performance? 那么,您如何建议我改善性能呢?

The fastest way to do bulk insert from a CSV file into PostgreSQL is with Copy . 将CSV文件批量插入PostgreSQL的最快方法是使用Copy The COPY command is optimized for inserting large numbers of rows. COPY命令已针对插入大量行进行了优化。

With Java you can use the Copy implementation for PostgreSQL JDBC driver 使用Java,您可以使用PostgreSQL JDBC驱动程序复制实现

There is a nice small example of how to use it here: how to copy a data from file to PostgreSQL using JDBC? 这里有一个很好的小例子: 如何使用JDBC将数据从文件复制到PostgreSQL?

If you have a CSV with headers you would want to run a command similar to this: 如果您有带有标头的CSV,则需要运行类似于以下命令:

\\COPY mytable FROM '/tmp/mydata.csv' DELIMITER ';' CSV HEADER

Another performance boost when you are adding large amounts of data to an existing table, is to drop the indexes, insert the data, and then recreate the indexes. 在向现有表中添加大量数据时,另一个性能提升是删除索引,插入数据,然后重新创建索引。

Here are a couple of measures you can take 您可以采取以下几种措施

Upgrade to jOOQ 3.6 升级到jOOQ 3.6

In jOOQ 3.6, there are two new modes in the Loader API: 在jOOQ 3.6中, Loader API中有两种新模式:

Using these techniques have been observed to speed up loading significantly, by orders of magnitudes. 已经观察到,使用这些技术可以显着地将加载速度提高几个数量级。 See also this article about JDBC batch loading performance . 另请参阅有关JDBC批处理加载性能的本文

Keep UNDO / REDO logs small 缩小UNDO / REDO日志

You currently load everything in one huge transaction (or you use auto-commit, but that's not good, either). 当前,您在一个巨大的事务中加载了所有内容(或者您使用了自动提交功能,但这也不是一件好事)。 This is bad for large loads, because the database needs to keep track of all the insertions in your insert session to be able to roll them back if needed. 这对于大负载是不利的,因为数据库需要跟踪插入会话中的所有插入,以便能够在需要时将它们回滚。

This gets even worse when you're doing that on a live system, where such large loads generate lots of contention. 当您在实时系统上执行此操作时,情况变得更糟,在该系统上,如此大的负载会产生大量争用。

jOOQ's Loader API allows you to specify the "commit" size via LoaderOptionsStep.commitAfter(int) jOOQ的Loader API允许您通过LoaderOptionsStep.commitAfter(int)指定“提交”大小

Turn off logging and constraints entirely 完全关闭日志记录和约束

This is only possible if you're loading stuff offline, but it can drastically speed up loading if you turn off logging entirely in your database (for that table), and if you turn off constraints while loading, turning them on again after the load. 仅当您离线加载内容时才有可能,但是,如果您完全关闭数据库(针对该表)的日志记录,并且在加载时关闭了约束,则在加载后再次打开约束,则可以大大加快加载速度。

Finally, I am also aware that I currently do not disable constraints on the PostgreSQL schema before insertion; 最后,我还知道,我目前在插入之前未禁用PostgreSQL模式上的约束; I have yet to try this (will it make a difference?). 我还没有尝试过(会有所作为吗?)。

Oh yes it will. 哦,是的。 Specifically the unique constraint costs a lot on each single insertion, as it has to be maintained all the time. 具体来说,唯一约束在每次插入时都会花费很多,因为必须一直保持这种约束。

Operate on more basic char[] manipulation API 使用更基本的char[]操作API进行操作

This code here: 这段代码在这里:

final List<String> parts = new ArrayList<>();
parts.add('"' + uuid.toString() + '"');
Arrays.stream(SEMICOLON.split(input, 4))
      .map(s -> '"' + CSV_ESCAPE.apply(s) + '"')
      .forEach(parts::add);
return String.join(";", parts) + '\n';

Generates a lot of pressure on your garbage collector as you're implicitly creating, and throwing away, a lot of StringBuilder objects ( some background on this can be found in this blog post ). 当您隐式创建并丢弃许多StringBuilder对象时,会对您的垃圾收集器产生很大压力( 有关此问题的某些背景知识,请参见本博文 )。 Normally, that's fine and shouldn't be optimised prematurely, but in a large batch process, you can certainly gain a couple of percents in speed if you transform the above into something more low level: 通常,这很好,不应该过早地进行优化,但是在大批处理过程中,如果将以上内容转换为更低的级别,则可以肯定地获得百分之几的速度:

StringBuilder result = new StringBuilder();
result.append('"').append(uuid.toString()).append('"');

for (String s : SEMICOLON.split(input, 4))
    result.append('"').append(CSV_ESCAPE.apply(s)).append('"');

...

Of course, you can still achieve write the same thing in a functional style, but I've found it way easier to optimise these low-level String operations using classic pre-Java 8 idioms. 当然,您仍然可以用功能样式来完成相同的事情,但是我发现使用经典的Java 8以前的惯用法来优化这些低级String操作更容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM