简体   繁体   English

在java中以多线程方式写入多个文件

[英]Writing to multiple files in a multithreaded way in java

I have a application in Java in which I need to use multi-threading.我有一个 Java 应用程序,我需要在其中使用多线程。 I have a list of ID' s which is primary key for tables stored in DynamoDB.我有一个ID'列表,它是存储在 DynamoDB 中的表的主键。

Say, the list is :说,名单是:

| | ID_1 | ID_1 | ID_2 | ID_2 | ID_3 | ID_3 | ID_4|.......| ID_4|.......| ID_n| ID_n|

Now I want multiple threads to read these ID's and do the following for each ID:现在我希望多个线程读取这些ID's并对每个 ID 执行以下操作:

  1. Each thread should take a ID and query DynamoDB tables (there are two dynamo DB tables for which ID is the primary key)每个线程应该取一个 ID 并查询 DynamoDB 表(有两个 dynamo DB 表,ID 是主键)

  2. The result of querying each Dynamo DB table should be stored in a separate file.查询每个 Dynamo DB 表的结果应存储在单独的文件中。

Essentially, Thread_1 should pick up a ID say ID_1 , it should query DynamoDB tables DDB_1 and DDB_2 .本质上, Thread_1应该选择一个 ID 说ID_1 ,它应该查询 DynamoDB 表DDB_1DDB_2 The result of querying DDB_1 should go in File1 and result of DDB_2 should go in File_2 .查询的结果DDB_1应该在File1和结果DDB_2中应该去File_2 This needs to be done for all the threads.这需要为所有线程完成。 Finally, when all threads have completed execution I should have two files File_1 and File_2 containing results of query from all the threads.最后,当所有线程都完成执行时,我应该有两个文件File_1File_2其中包含来自所有线程的查询结果。

I have come up with a solution that let all producer threads (threads which get the query results from Dynamo DB) queue the results of the query to a single consumer thread which writes to a file say File_1 .我提出了一个解决方案,让所有生产者线程(从 Dynamo DB 获取查询结果的线程)将查询结果排队到一个写入文件的消费者线程File_1 Similarly all producer threads write to a second queue and a second consumer thread writes to File_2 .类似地,所有生产者线程都写入第二个队列,第二个消费者线程写入File_2

Do you feel any flaw in the approach above?你觉得上面的方法有什么缺陷吗? Is there a better way to apply multi-threading in this case?在这种情况下,有没有更好的方法来应用多线程?

If i understand right, you want 2 Threads that each query a db-table and post the results in a file.如果我理解正确,您需要 2 个线程,每个线程查询一个 db-table 并将结果发布到一个文件中。 See under.见下。

APPLICATION
|
|-->THREAD --> DB_1 --> file1
|
|-->THREAD --> DB_2 --> file2

First off this should be perfectly fine, you are not reading and writing to/from the same data, meaning this is threadsafe.首先,这应该完全没问题,您不是从相同的数据读取和写入,这意味着这是线程安全的。 The way you want to do this is making a class for each Thread(just an example).你想要这样做的方式是为每个线程创建一个类(只是一个例子)。 Do this by extending runnable.通过扩展 runnable 来做到这一点。 Then place all the code for connection to a DB in the run method.然后将所有连接到数据库的代码放在 run 方法中。 Long example: http://www.tutorialspoint.com/java/java_multithreading.htm长示例: http : //www.tutorialspoint.com/java/java_multithreading.htm

Short example简短示例

class Thread1 implements Runnable {

    public void run() {
        Connect/write
    } 
}

Call by using使用调用

Thread1 t = new Thread1();
t.start();

This should work fine as long as you are not editing the ID's while you are reading them in one of these Threads.只要您在这些线程之一中阅读 ID 时没有编辑 ID,这应该可以正常工作。

Using synchronized使用同步

This locks a method to a single Thread, for example when writing to the same file this is necessary as the Threads will interupt each other.这会将方法锁定到单个线程,例如,当写入同一个文件时,这是必要的,因为线程将相互中断。

public synchronized void write(text, file1, file2){

}

Call this like a normal method in your Threads.像线程中的普通方法一样调用它。 This does NOT guarantee the order in which the Threads access these methods, in this example it's first come first serve.这并不能保证线程访问这些方法的顺序,在这个例子中它是先到先得。

This is what you want to achieve:-这就是您想要实现的目标:-

ID_1 -> Thread1 -> Query DB1 ->  ConsumerSingleton -> Write data to File 1
                -> Query DB2 ->  ConsumerSingleton -> Write data to File 2
ID_2 -> Thread2 -> Query DB1 ->  ConsumerSingleton -> Write data to File 1
                -> Query DB2 ->  ConsumerSingleton -> Write data to File 2

ID_3 -> Thread3 -> Query DB1 ->  ConsumerSingleton -> Write data to File 1
                -> Query DB2 ->  ConsumerSingleton -> Write data to File 2
..
..  
ID_N -> ThreadN -> Query DB1 ->  ConsumerSingleton -> Write data to File 1
                -> Query DB2 ->  ConsumerSingleton -> Write data to File 2

Since you are using single consumer object you don't have to take care of synchronize write operation of file1 & file2.由于您使用的是单个消费者对象,因此您不必处理 file1 和 file2 的同步写入操作。 However you have to synchronize the operation/method where your threads will be dumping the result to consumer's collection.但是,您必须同步操作/方法,您的线程将在其中将结果转储到使用者的集合。 You can use ConcurrentHashMap to collect the results from different threads in your consumer class which is thread safe.您可以使用 ConcurrentHashMap 从您的消费者类中的不同线程收集结果,这是线程安全的。

Also, since you are going to read rows from DB1 and DB2 based on unique id's row level lock should not happen while multiple thread tries to access.此外,由于您将根据唯一 ID 的行级锁从 DB1 和 DB2 读取行,因此在多个线程尝试访问时不应发生。 If this is not the case and 2 thread tries to read row with same ID contention can happen.如果不是这种情况并且 2 个线程尝试读取具有相同 ID 的行,则可能会发生争用。

Do you feel any flaw in the approach above?你觉得上面的方法有什么缺陷吗?

I can't spot one.我一个都看不出来But of course, I can only comment based on your high-level description of your algorithm.但是当然,我只能根据您对算法的高级描述发表评论。 There will be right and wrong ways to implement it.将有正确和错误的方法来实施它。

Is there a better way to apply multi-threading in this case?在这种情况下,有没有更好的方法来应用多线程?

It is hard to say.这很难说。 But I can't think of any alternative that is obviouly better.但我想不出任何明显更好的替代方案。 There are (no doubt) alternatives, but the only way you could objectively determine which is best 1 would be to implement various alternatives and benchmark them.有(毫无疑问)的替代品,但你可以客观确定的唯一途径,这是最好1。将实施各种方案和比较基准。

Note that the bottlenecks for this application are likely to be:请注意,此应用程序的瓶颈可能是:

  • the effective throughput of your DynamoDB queries DynamoDB 查询的有效吞吐量
  • the rate at which you can write the results to file您可以将结果写入文件的速率

(Probably, the former will dominate.) Since both are going to be limited by "external" factors (eg disc I/O, networking, load on the database CPUs) you will most likely need to "tune" the number of worker threads you use. (可能前者会占主导地位。)由于两者都将受到“外部”因素(例如磁盘 I/O、网络、数据库 CPU 负载)的限制,因此您很可能需要“调整”工作线程的数量你用。


1 - I assume you mean the one that has the best throughput. 1 - 我假设您的意思是具有最佳吞吐量的那个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM