简体   繁体   中英

Iterate over a directory and extract only file names without reading the payload

I am using the Mule 4.4 community edition on premise. Thanks to help, I have been able to read a large file without consuming memory and processing it, which is all good ( here ).

Now building on this further - my use case is to read all .csv files from within a directory. And then process them one by one:


So my plan was to list the files in the directory:

<sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <non-repeatable-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
                  directories="EXCLUDE" symLinks="EXCLUDE" />

And then I wanted to only read file names from directory and not read payload.

As per this early access article we are advised to use <non-repeatable-iterable /> . However, after the list file operation as per article when I try to extract attributes:

<set-payload doc:name="Set Payload"  value="#[output application/json --- payload map $.attributes]"/>

No attributes are available... (my plan is to extract the file names and then run a for loop for each file name and then a choice condition to determine if file name has student , use student transformer, if teacher use teacher transformer, etc.)

However, as attributes are not available, I am not able to pass file names to the for loop (yet to be written).

So I changed from <non-repeatable-iterable /> to <repeatable-in-memory-iterable />

Code below:

<sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <repeatable-in-memory-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
                  directories="EXCLUDE" symLinks="EXCLUDE" />

Using the above, I can extract the attributes of file names.

I am confused about the following:

  1. The files to be processed in the above directory will be large (each file 700 MB), so while iterating the directory by using repeatable-in-memory-iterable , will it cause any memory issues? (I do not want to read file content, simply get file names at this stage)

Here is the complete payload till now (note - it does not contain any for loop to iterate over files, which I will plug in...)

<flow name="employee-process-flow">
    <http:listener doc:name="Listener"  config-ref="HTTP_Listener_config" path="/processFiles"/>
    <set-variable value='#[now() as String { format: "ddMMuu" }]' doc:name="Set todays date as ddmmyy" doc:id="c6a91a41-65b1-46df-a720-9c13fe360b6b" variableName="today"/>

    <sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <repeatable-in-memory-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
        directories="EXCLUDE" symLinks="EXCLUDE" />

    <set-payload doc:name="Set Payload" value="#[output application/json --- payload map $.attributes]"/>
    <foreach doc:name="For Each" >
        <logger level="INFO" doc:name="Logger"  message="we are here"/>


The List operation returns a list of messages, and each has a payload and attributes. The content of the files is returned as the payload, in a lazy way, meaning that the file's content is read only if you try to access that element's payload.

It makes sense that if you a non-repeatable-iterator and don't access the payload of each item in the <foreach> then you should not have any memory issues, because the contents are not read.

By using in memory repeatable streaming it is possible that the entire payload is being read into memory. Try reading a file a few gigabytes in size and see what happens there.

I'm not sure what the problem is with the attributes. It should work the same in any streaming mode.

Note that if you plan on doing something with the attributes—other than printing them—then you should output to application/java instead of JSON, to avoid unneeded conversions to and from JSON. For example, in your flow the output is used as input for the <foreach> , so it would be better for it to be Java.

Example: output application/java --- payload map $.attributes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM