简体   繁体   中英

GC problems when inserting large CSV file with Apache Camel

I have a large CSV file containing about 5.2M lines. I want to parse this file and insert the data into a database. I am using apache camel for this.

The route is fairly easy (simplified for this example)

from("file:work/customer/").id("customerDataRoute")
.split(body().tokenize("\n")).streaming()
.parallelProcessing()
.unmarshal(bindyCustomer)
.split(body())
.process(new CustomerProcessor())
.to("sql:INSERT INTO CUSTOMER_DATA(`FIELD1`,`FIELD2`) VALUES(#,#)");

bindyCustomer is a BindyCsvDataFormat for the CSV file and CustomerProcessor is a Processor that returns the data of the Bindy Customer object as an array of objects for the SQL insert. The actual object has 39 fields (simplified above).

This works all okay for the first 800.000 to 1.000.000 lines but then it comes to a halt.

I have monitored the camel instance with JVisualVM and the Visual GC plugin and I can see that the old generation fills up and when it reaches maximum the whole system comes to a halt but it does not crash. At this point the old generation is full, the Eden space is nearly full and both Survivor spaces are empty (as it can't move anything to the old generation I guess).

So what is wrong here? This looks like a memory leak in the Camel SQL component to me. The data is mainly stored in ConcurrentHashMap objects.

When I take out the SQL component the old generation hardly fills at all.

I am using Camel 2.15.1 Will try to use 2.17.1 to see whether that fixes the problem.

Update: I have tried Camel 2.17.1 (same problem) and I have tried to insert the do the inserts in Java using java.sql.Statement.executeUPdate. With this option I managed to insert about 2.6 M rows but then it stopped as well. The funny thing is that I don't get a memory error. It just comes to a halt.

I didn't test your code, however, I did note that your second split statement is not streaming. I recommend trying that. If you have too many parallel streams of work the GC could fill up before you release the resources which would lock you up. The Time the SQL statement takes is probably what is allowing the GC to get too much build up time since you are parallelizing the main processing.

from("file:work/customer/").id("customerDataRoute")
    .split(body().tokenize("\n")).streaming().parallelProcessing()
        .unmarshal(bindyCustomer)
        .split(body()).streaming() //Add a streaming call here and see what happens
            .process(new CustomerProcessor())
            .to("sql:INSERT INTO CUSTOMER_DATA(`FIELD1`,`FIELD2`) VALUES(#,#)");

Okay I figured out what went wrong here. Basically the reading part was too fast compared to the inserting part. The example was a bit oversimplified as there was a seda queue between the reading and inserting (as I had to do a choice on the content which wasn't shown in the example). But even without the seda queue it never finished. I realized what was wrong when I killed camel and got a message that there were still several thousand in-flight messages.

So there is no point in doing the reading with parallel processing when the inserting side can't keep up.

from("file:work/customer/").id("customerDataRoute")
        .onCompletion().log("Customer data  processing finished").end()
        .log("Processing customer data ${file:name}")
        .split(body().tokenize("\n")).streaming() //no more parallel processing
        .choice()
            .when(simple("${body} contains 'HEADER TEXT'")) //strip out the header if it exists
            .log("Skipping first line")
            .endChoice()
        .otherwise()
            .to("seda:processCustomer?size=40&concurrentConsumers=20&blockWhenFull=true")
            .endChoice();


from("seda:processCustomer?size=40&concurrentConsumers=20&blockWhenFull=true")
            .unmarshal(bindyCustomer)
            .split(body())
            .process(new CustomerProcessor()).id("CustomProcessor") //converts one Notification into an array of values for the SQL insert
.to("sql:INSERT INTO CUSTOMER_DATA(`FIELD1`,`FIELD2`) VALUES(#,#)");

I defined a size on the SEDA queue (by default it is not limited) and made the calling thread block when the queue is full.

seda:processCustomer?size=40&concurrentConsumers=20&blockWhenFull=true

The parallel processing is done by using 20 concurrent consumers on the SEDA queue. Please note that for what ever reason you have to specify the queue size when you call the route as well (not only where you define it).

Now the memory consumption is minimal and it inserts the 5 million records without problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM