简体   繁体   English

如何在Mule中读取巨大的CSV文件

[英]How to read huge CSV file in Mule

I'am using Mule Studio 3.4.0 Community Edition. 我正在使用Mule Studio 3.4.0社区版。 I have a big problem about how to parse a large CSV file incoming with File Endpoint. 关于如何解析使用File Endpoint传入的大型CSV文件,我遇到了很大的问题。 The scenario is that I have 3 CSV files and I would putting the files'content into a database. 场景是我有3个CSV文件,我会把文件的内容放到数据库中。 But when I try to load a huge file (about 144MB) I get the "OutOfMemory" Exception. 但是当我尝试加载一个巨大的文件(大约144MB)时,我得到了“OutOfMemory”异常。 I thought as solution to divide/split my the large CSV into smaller size CSVs (I don't know if this solution is the best) o try to find a way to process CSV without throwing an exception. 我认为将大型CSV划分/拆分为较小尺寸的CSV(我不知道这个解决方案是否最好)的解决方案o尝试找到一种处理CSV的方法而不会抛出异常。

<file:connector name="File" autoDelete="true" streaming="true" validateConnections="true" doc:name="File"/>

<flow name="CsvToFile" doc:name="CsvToFile">
        <file:inbound-endpoint path="src/main/resources/inbox" moveToDirectory="src/main/resources/processed"  responseTimeout="10000" doc:name="CSV" connector-ref="File">
            <file:filename-wildcard-filter pattern="*.csv" caseSensitive="true"/>
        </file:inbound-endpoint>
        <component class="it.aizoon.grpBuyer.AddMessageProperty" doc:name="Add Message Property"/>
        <choice doc:name="Choice">
            <when expression="INVOCATION:nome_file=azienda" evaluator="header">
                <jdbc-ee:csv-to-maps-transformer delimiter="," mappingFile="src/main/resources/companies-csv-format.xml" ignoreFirstRecord="true" doc:name="CSV2Azienda"/>
                <jdbc-ee:outbound-endpoint exchange-pattern="one-way" queryKey="InsertAziende" queryTimeout="-1" connector-ref="jdbcConnector" doc:name="Database Azienda">
                    <jdbc-ee:query key="InsertAziende" value="INSERT INTO aw006_azienda VALUES (#[map-payload:AW006_ID], #[map-payload:AW006_ID_CLIENTE], #[map-payload:AW006_RAGIONE_SOCIALE])"/>
                </jdbc-ee:outbound-endpoint>
            </when>
            <when expression="INVOCATION:nome_file=servizi" evaluator="header">
                <jdbc-ee:csv-to-maps-transformer delimiter="," mappingFile="src/main/resources/services-csv-format.xml" ignoreFirstRecord="true" doc:name="CSV2Servizi"/>
                <jdbc-ee:outbound-endpoint exchange-pattern="one-way" queryKey="InsertServizi" queryTimeout="-1" connector-ref="jdbcConnector" doc:name="Database Servizi">
                    <jdbc-ee:query key="InsertServizi" value="INSERT INTO ctrl_aemd_unb_servizi VALUES (#[map-payload:CTRL_ID_TIPO_OPERAZIONE], #[map-payload:CTRL_DESCRIZIONE], #[map-payload:CTRL_COD_SERVIZIO])"/>
                </jdbc-ee:outbound-endpoint>
            </when>
            <when expression="INVOCATION:nome_file=richiesta" evaluator="header">
                <jdbc-ee:csv-to-maps-transformer delimiter="," mappingFile="src/main/resources/requests-csv-format.xml" ignoreFirstRecord="true" doc:name="CSV2Richiesta"/>
                <jdbc-ee:outbound-endpoint exchange-pattern="one-way" queryKey="InsertRichieste" queryTimeout="-1" connector-ref="jdbcConnector" doc:name="Database Richiesta">
                    <jdbc-ee:query key="InsertRichieste" value="INSERT INTO ctrl_aemd_unb_richiesta VALUES (#[map-payload:CTRL_ID_CONTROLLER], #[map-payload:CTRL_NUM_RICH_VENDITORE], #[map-payload:CTRL_VENDITORE], #[map-payload:CTRL_CANALE_VENDITORE], #[map-payload:CTRL_CODICE_SERVIZIO], #[map-payload:CTRL_STATO_AVANZ_SERVIZIO], #[map-payload:CTRL_DATA_INSERIMENTO])"/>
                </jdbc-ee:outbound-endpoint>
            </when>
        </choice>   
    </flow>

Please, I do not know how to fix this problem. 拜托,我不知道如何解决这个问题。 Thanks in advance for any kind of help 提前感谢您提供任何帮助

As SteveS said, the csv-to-maps-transformer might try to load the entire file to memory before process it. 正如SteveS所说, csv-to-maps-transformer可能会在处理之前尝试将整个文件加载到内存中。 What you can try to do is split the csv file in smaller parts and send those parts to VM to be processed individually. 您可以尝试做的是将csv文件拆分为较小的部分,然后将这些部分发送到VM以便单独处理。 First, create a component to achieve this first step: 首先,创建一个组件来实现第一步:

public class CSVReader implements Callable{
    @Override
    public Object onCall(MuleEventContext eventContext) throws Exception {

        InputStream fileStream = (InputStream) eventContext.getMessage().getPayload();
        DataInputStream ds = new DataInputStream(fileStream);
        BufferedReader br = new BufferedReader(new InputStreamReader(ds));

        MuleClient muleClient = eventContext.getMuleContext().getClient();

        String line;
        while ((line = br.readLine()) != null) {
            muleClient.dispatch("vm://in", line, null);
        }

        fileStream.close();
        return null;
    }
}

Then, split your main flow in two 然后,将主流分成两部分

<file:connector name="File" 
    workDirectory="yourWorkDirPath" autoDelete="false" streaming="true"/>

<flow name="CsvToFile" doc:name="Split and dispatch">
    <file:inbound-endpoint path="inboxPath"
        moveToDirectory="processedPath" pollingFrequency="60000"
        doc:name="CSV" connector-ref="File">
        <file:filename-wildcard-filter pattern="*.csv"
            caseSensitive="true" />
    </file:inbound-endpoint>
    <component class="it.aizoon.grpBuyer.AddMessageProperty" doc:name="Add Message Property" />
    <component class="com.dgonza.CSVReader" doc:name="Split the file and dispatch every line to VM" />
</flow>

<flow name="storeInDatabase" doc:name="receive lines and store in database">
    <vm:inbound-endpoint exchange-pattern="one-way"
        path="in" doc:name="VM" />
    <Choice>
        .
        .
        Your JDBC Stuff
        .
        .
    <Choice />
</flow>

Maintain your current file-connector configuration to enable streaming. 保持当前的file-connector配置以启用流式传输。 With this solution the csv data can be processed without the need to load the entire file to memory first. 使用此解决方案,可以处理csv数据,而无需先将整个文件加载到内存中。 HTH HTH

I believe that the csv-to-maps-transformer is going to force the whole file into memory. 我相信csv-to-maps-transformer会强制整个文件进入内存。 Since you are dealing with one large file, personally, I would tend to just write a Java class to handle it. 既然你正在处理一个大文件,我个人倾向于编写一个Java类来处理它。 The File endpoint will pass a filestream to your custom transformer. File端点将文件流传递给您的自定义转换器。 You can then make a JDBC connection and pick off the information a row at a time without having to load the whole file. 然后,您可以建立JDBC连接并一次一行地选择信息,而无需加载整个文件。 I have used OpenCSV to parse the CSV for me. 我已经使用OpenCSV为我解析CSV。 So your java class would contain something like the following: 所以你的java类将包含如下内容:

protected Object doTransform(Object src, String enc) throws TransformerException {  

    try {
        //Make a JDBC connection here

        //Now read and parse the CSV

        FileReader csvFileData = (FileReader) src;


        BufferedReader br = new BufferedReader(csvFileData);
        CSVReader reader = new CSVReader(br);

        //Read the CSV file and add the row to the appropriate List(s)
        String[] nextLine;
        while ((nextLine = reader.readNext()) != null) {
            //Push your data into the database through your JDBC connection
        }
        //Close connection.

               }catch (Exception e){
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM