簡體   English   中英

使用生產者處理巨大的CSV文件-使用者模式

[英]Processing Huge CSV File using Producer - Consumer Pattern

我正在嘗試處理任意的CSV文件,其范圍從10條記錄到數百萬條記錄。 CSV文件具有4個固定列(例如a,b,c,d)和2個其他列(e,f),它們來自外部REST API。

我的目標是從CSV讀取所有記錄,對於每條記錄,調用外部REST API帶來2個附加列,並將結果CSV輸出為合並的CSV。 輸出應該是具有(a,b,c,d,e,f)列的同一csv文件。

我使用Spring Integration使用EIP中的Content Enricher模式實現了此方案,並且能夠實現預期的輸出,但是當我順序讀取CSV文件時,此解決方案對於記錄數量較少的情況效果很好,但很快就可以了。 記錄數增加,以O(n)方式執行程序的時間也增加。

我進一步開始實現Producer-Consumer設計模式,並嘗試以如下方式實現代碼:從CSV讀取的每個記錄然后使用put()放入一個隊列,然后使用take()方法從同一個共享隊列中讀取多個Consumers。的BlockingQueue。 Main程序使用Executors.newFixedSizedThreadPool(3)實例化具有1個生產者和多個消費者的ExecutorService ,但是我面臨着兩個挑戰:

  1. take()方法永遠不會退出。 我嘗試通過添加終結器對象來實現Poison Pill,然后在Consumer循環中檢查是否釋放了相同的Poison Pill,但是它仍然從未中斷(我在循環中添加了一個系統,以查看它是否曾經到達Poison Pill,會打印出我的聲明),為什么不退出?

  2. CSV文件僅保留從上次執行的使用者線程讀取的數據,並覆蓋從其他使用者寫入的所有內容-我正在使用OpenCSV讀取/寫入CSV數據。

這是我現在上傳的代碼。 有人可以指導我做錯什么地方以及需要改進的地方嗎?

主程序

**

BlockingQueue<Account> queue = new ArrayBlockingQueue<>(100);
    AccountProducer readingThread = new AccountProducer(inputFileName, queue);
    //new Thread(readingThread).start();
    ExecutorService producerExecutor = Executors.newFixedThreadPool(1);
    producerExecutor.submit(readingThread);

    AccountConsumer normalizers = new AccountConsumer(outputFileName, queue, accountService );
    ExecutorService consumerExecutor = Executors.newFixedThreadPool(3);

    for (int i = 1; i <= 3; i++) {
        consumerExecutor.submit(normalizers);
    }
    producerExecutor.shutdown();
    consumerExecutor.shutdown();

AccountProducer

public class AccountProducer implements Runnable {
private String inputFileName;
private BlockingQueue<Account> blockingQueue;
private static final String TERMINATOR = "TERMINATOR";

public AccountProducer (String inputFileName, BlockingQueue<Account> blockingQueue) {

    this.inputFileName = inputFileName;
    this.blockingQueue = blockingQueue;
}


@Override
public void run() {

    try (Reader reader = Files.newBufferedReader(Paths.get(inputFileName));) {

        PropertyEditorManager.registerEditor(java.util.Date.class, DateEditor.class);
        ColumnPositionMappingStrategy<Account> strategy = new ColumnPositionMappingStrategy<>();
        strategy.setType(Account.class);
        String[] memberFieldsToBindTo = { "accountId", "accountName", "firstName", "createdOn" };
        strategy.setColumnMapping(memberFieldsToBindTo);

        CsvToBean<Account> csvToBean = new CsvToBeanBuilder<Account>(reader).withMappingStrategy(strategy)
                .withSkipLines(1).withIgnoreLeadingWhiteSpace(true).build();

        Iterator<Account> csvAccountIterator = csvToBean.iterator();

        while (csvAccountIterator.hasNext()) {
            Account account = csvAccountIterator.next();    
            // Checking if the Account Id in the csv is blank / null - If so, we skip the
            // row for processing and hence avoiding API call..
            if (null == account.getAccountId() || account.getAccountId().isEmpty()) {
                continue;
            } else {
                // This call will send the extracted Account Object down the Enricher to get
                // additional details from API
                blockingQueue.put(account);
            }
        }
    } catch (InterruptedException | IOException ex) {
        System.out.println(ex);
    } finally {
        while (true) {
            try {
                Account account = new Account();
                account.setAccountId(TERMINATOR);
                blockingQueue.put(account);
                break;
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }
}
}

AccountConsumer

public class AccountConsumer implements Runnable {

private String outputFileLocation;
private BlockingQueue<Account> blockingQueue;
private AccountService accountService;

public AccountConsumer(String outputFileLocation, BlockingQueue<Account> blockingQueue, AccountService accountService) {
    this.blockingQueue = blockingQueue;
    this.outputFileLocation = outputFileLocation;
    this.accountService = accountService;
}

@Override
public void run() {
    List<Account> accounts = new ArrayList<>();

    try {
        while (true) {
            Account account = blockingQueue.poll();
            account = accountService.populateAccount(account);
            accounts.add(account);
        }

    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
    } catch (Exception ex) {
        System.out.println(ex);
    }
    processOutput(accounts, outputFileLocation);
}

/**
 * The method processOutput simply takes the list of Accounts and writes them to
 * CSV.
 * 
 * @param outputFileName
 * @param accounts
 * @throws Exception
 */
private void processOutput(List<Account> accounts, String outputFileName) {

    System.out.println("List Size is : " + accounts.size());
    // Using try with Resources block to make sure resources are released properly
    try (Writer writer = new FileWriter(outputFileName, true);) {
        StatefulBeanToCsv<Account> beanToCsv = new StatefulBeanToCsvBuilder(writer).build();
        beanToCsv.write(accounts);
    } catch (CsvDataTypeMismatchException | CsvRequiredFieldEmptyException ex) {
        System.out.println(ex);
        //logger.error("Unable to write the output CSV File : " + ex);
        //throw ex;
    } catch (IOException e) {
        e.printStackTrace();
    }
}

}

這是我正在使用的Spring Integration XML:

<?xml version="1.0" encoding="UTF-8"?>
<beans:beans xmlns="http://www.springframework.org/schema/integration"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans 
http://www.springframework.org/schema/beans/spring-beans.xsd
    http://www.springframework.org/schema/context 
http://www.springframework.org/schema/context/spring-context.xsd
    http://www.springframework.org/schema/task 
http://www.springframework.org/schema/task/spring-task.xsd
    http://www.springframework.org/schema/integration 
http://www.springframework.org/schema/integration/spring-integration.xsd"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:beans="http://www.springframework.org/schema/beans" 
xmlns:task="http://www.springframework.org/schema/task">

<channel id="accountChannel" /> 
<!-- accountOutputChannel is used for putting the Account object on the 
Channel which will then be consumed by accountAPIChannel as Input to the API 
-->
<channel id="accountOutputChannel" />
<!-- accountAPIChannel will take 1 accountId at a time and invoke REST API 
Service to get additional details required to fill the Content Enricher -->
<channel id="accountAPIChannel" />

<!-- accountGateway is the entry point to the utility -->
<gateway id="accountGateway" default-request-timeout="5000"
    default-reply-timeout="5000"
    service-interface="com.epe.service.AccountService"
    default-request-channel="accountChannel">
</gateway>

<!--Content  enricher is used here for enriching an existing message with 
additional data from External API
     This is based on EIP Pattern - Content Enricher -->
<enricher id="enricher" input-channel="accountOutputChannel"
    request-channel="accountAPIChannel">
    <property name="status" expression="payload.status" />
    <property name="statusSetOn" expression="payload.createdOn" />
</enricher>

<beans:bean id="accountService"
    class="com.epe.service.impl.AccountServiceImpl" />

<!-- Below service-activator is used to actually invoke the external REST 
API which will provide the additional fields for enrichment -->
<service-activator id="fetchAdditionalAccountInfoServiceActivator"
    ref="accountInfoService" method="getAdditionalAccountInfoService" 
input-channel="accountAPIChannel"
    />

<!-- accountInfoService is a bean which will be used for fetching 
additional information from REST API Service -->
<beans:bean id="accountInfoService"
    class="com.epe.service.impl.AccountInfoServiceImpl" />

</beans:beans>

您在代碼中使用poll() ,而不是take()

您應該使用帶超時的poll()代替,例如poll(10, TimeUnit.SECONDS)以便可以優雅地終止每個線程。

但是,您不需要所有這些邏輯。 您可以使用Spring集成組件ExecutorChannel和附加模式下的文件出站通道適配器來實現所有這些功能。

編輯

我沒有時間編寫您的整個應用程序,但是本質上您需要...

<file:inbound-channel-adapter />
<file:splitter output-channel="execChannel"/>
<int:channel id="execChannel">
    <int:dispatcher task-executor="exec" />
</int:channel>
<int:transformer /> <!-- OpenCSV -->
<int:enricher ... />
<int:transformer /> <!-- OpenCSV -->
<int:resequencer /> <!== restore order -->
<file:outbound-channel-adapter />

您可以在參考手冊中閱讀有關所有這些組件的信息。

您可能還需要考慮使用Java DSL而不是xml。 就像是...

@Bean
public IntegrationFlow flow() {
    return IntegrationFlows.from(File.inboundChannelAdapter(...))
              .split(Files.splitter())
              .channel(MessageChannels.executor(exec())
              .transform(...)
              .enrich(...)
              .transform(...)
              .resequence()
              .handle(File.outboundCHannelAdapter(...))
              .get();

在AccountProducer中

catch (InterruptedException | IOException ex) {
  System.out.println(ex);
 } 

這不是處理InterruptedException的正確方法。 ExecutorService使用中斷來強制關閉(shutDownNow()),但是由於您吃了中斷,ExecutorService將不會對強制擊落做出響應。

在AccountConsumer中

catch (InterruptedException e) {
 Thread.currentThread().interrupt();
}

這確保線程將引發InterruptedException,可以將其重新設計為

try {
        while (true) {
            Account account = blockingQueue.poll();
            account = accountService.populateAccount(account);
            accounts.add(account);
            if(Thread.currentThread().isInterrupted()) {
                System.out.println("Thread interrupted and hence exiting...");
                break;
            }
        }
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
    } catch (Exception ex) {
        System.out.println(ex);
    }

編輯還可以對ExecutorService調用shutdown()不會導致立即銷毀

一種使用awaitTermination()方法關閉ExecutorService的好方法

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM