简体   繁体   English

DynamoDB并行扫描-Java同步

[英]DynamoDB Parallel Scan - Java Synchronization

I'm trying to use the DynamoDB Parallel Scan Example: 我正在尝试使用DynamoDB并行扫描示例:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LowLevelJavaScanning.html

I have 200,000 items, and I've taken the sequential code scan, and modified it slightly for my usage: 我有200,000件商品,并且已经进行了顺序代码扫描,并为自己的使用做了一些修改:

Map<String, AttributeValue> lastKeyEvaluated = null;
do
{
    ScanRequest scanRequest = new ScanRequest()
    .withTableName(tableName)
    .withExclusiveStartKey(lastKeyEvaluated);

    ScanResult result = client.scan(scanRequest);


    double counter = 0;
    for(Map<String, AttributeValue> item : result.getItems())
    {
        itemSerialize.add("Set:"+counter);
        for (Map.Entry<String, AttributeValue> getItem : item.entrySet()) 
        {
            String attributeName = getItem.getKey();
            AttributeValue value = getItem.getValue();

            itemSerialize.add(attributeName
                    + (value.getS() == null ? "" : ":" + value.getS())
                    + (value.getN() == null ? "" : ":" + value.getN())
                    + (value.getB() == null ? "" : ":" + value.getB())
                    + (value.getSS() == null ? "" : ":" + value.getSS())
                    + (value.getNS() == null ? "" : ":" + value.getNS())
                    + (value.getBS() == null ? "" : ":" + value.getBS()));
        }
        counter += 1;
    }

    lastKeyEvaluated = result.getLastEvaluatedKey();
}
while(lastKeyEvaluated != null);

The counter gives exactly 200,000 when this code has finished, however, I also wanted to try the parallel scan. 这段代码完成后,计数器会精确地提供200,000,但是,我也想尝试并行扫描。

Function Call: 函数调用:

ScanSegmentTask task = null;
ArrayList<String> list = new ArrayList<String>();
try
{
    ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
    int totalSegments = numberOfThreads;

    for (int segment = 0; segment < totalSegments; segment++) 
    {
        // Runnable task that will only scan one segment
        task = new ScanSegmentTask(tableName, itemLimit, totalSegments, segment, list);

        // Execute the task
        executor.execute(task);
    }
    shutDownExecutorService(executor);
}
.......Catches something if error
return list;

Class: 类:

I have a static list that the data is shared with all the threads. 我有一个静态列表,该数据与所有线程共享。 I was able to retrieve the lists, and output the amount of data. 我能够检索列表,并输出数据量。

// Runnable task for scanning a single segment of a DynamoDB table
private static class ScanSegmentTask implements Runnable 
{

    // DynamoDB table to scan
    private String tableName;

    // number of items each scan request should return
    private int itemLimit;

    // Total number of segments
    // Equals to total number of threads scanning the table in parallel
    private int totalSegments;

    // Segment that will be scanned with by this task
    private int segment;

    static ArrayList<String> list_2;

    Object lock = new Object();

    public ScanSegmentTask(String tableName, int itemLimit, int totalSegments, int segment, ArrayList<String> list) 
    {
        this.tableName = tableName;
        this.itemLimit = itemLimit;
        this.totalSegments = totalSegments;
        this.segment = segment;
        list_2 = list;
    }

    public void run() 
    {
        System.out.println("Scanning " + tableName + " segment " + segment + " out of " + totalSegments + " segments " + itemLimit + " items at a time...");
        Map<String, AttributeValue> exclusiveStartKey = null;
        int totalScannedItemCount = 0;
        int totalScanRequestCount = 0;
        int counter = 0;

        try 
        {
            while(true) 
            {
                ScanRequest scanRequest = new ScanRequest()
                    .withTableName(tableName)
                    .withLimit(itemLimit)
                    .withExclusiveStartKey(exclusiveStartKey)
                    .withTotalSegments(totalSegments)
                    .withSegment(segment);

                ScanResult result = client.scan(scanRequest);

                totalScanRequestCount++;
                totalScannedItemCount += result.getScannedCount();

                synchronized(lock)
                {
                    for(Map<String, AttributeValue> item : result.getItems())
                    {
                        list_2.add("Set:"+counter);
                        for (Map.Entry<String, AttributeValue> getItem : item.entrySet()) 
                        {
                            String attributeName = getItem.getKey();
                            AttributeValue value = getItem.getValue();

                            list_2.add(attributeName
                                    + (value.getS() == null ? "" : ":" + value.getS())
                                    + (value.getN() == null ? "" : ":" + value.getN())
                                    + (value.getB() == null ? "" : ":" + value.getB())
                                    + (value.getSS() == null ? "" : ":" + value.getSS())
                                    + (value.getNS() == null ? "" : ":" + value.getNS())
                                    + (value.getBS() == null ? "" : ":" + value.getBS()));
                        }
                        counter += 1;
                    }
                }

                exclusiveStartKey = result.getLastEvaluatedKey();
                if (exclusiveStartKey == null) 
                {
                    break;
                }
            }
        } 
        catch (AmazonServiceException ase) 
        {
            System.err.println(ase.getMessage());
        } 
        finally 
        {
            System.out.println("Scanned " + totalScannedItemCount + " items from segment " + segment + " out of " + totalSegments + " of " + tableName + " with " + totalScanRequestCount + " scan requests");
        }
    }
}

Executor Service Shut Down: 执行器服务关闭:

public static void shutDownExecutorService(ExecutorService executor) 
{
    executor.shutdown();
    try 
    {
        if (!executor.awaitTermination(10, TimeUnit.SECONDS)) 
        {
            executor.shutdownNow();
        }
    } 
    catch (InterruptedException e) 
    {
        executor.shutdownNow();
        Thread.currentThread().interrupt();
    }
}

However, the amount of items changes every time I run this piece of code (Varies around 60000 in total, 6000 per threads, with 10 created threads). 但是,每次我运行这段代码时,项目的数量都会发生变化(总共变化约60000,每个线程6000,创建的10个线程)。 Removing synchronization does not change the result too. 删除同步也不会改变结果。

Is there a bug with the synchronization or with the Amazon AWS API? 同步或Amazon AWS API是否存在错误?

Thanks All 谢谢大家

EDIT: 编辑:

The new function call: 新函数调用:

ScanSegmentTask task = null;
ArrayList<String> list = new ArrayList<String>();

try
{
    ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
    int totalSegments = numberOfThreads;

    for (int segment = 0; segment < totalSegments; segment++) 
    {
        // Runnable task that will only scan one segment
        task = new ScanSegmentTask(tableName, itemLimit, totalSegments, segment);

        // Execute the task
        Future<ArrayList<String>> future = executor.submit(task);

        list.addAll(future.get());
    }
    shutDownExecutorService(executor);
}

The new class: 新课程:

// Runnable task for scanning a single segment of a DynamoDB table
private static class ScanSegmentTask implements Callable<ArrayList<String>>
{

    // DynamoDB table to scan
    private String tableName;

    // number of items each scan request should return
    private int itemLimit;

    // Total number of segments
    // Equals to total number of threads scanning the table in parallel
    private int totalSegments;

    // Segment that will be scanned with by this task
    private int segment;

    ArrayList<String> list_2 = new ArrayList<String>();

    static int counter = 0;

    public ScanSegmentTask(String tableName, int itemLimit, int totalSegments, int segment)
    {
        this.tableName = tableName;
        this.itemLimit = itemLimit;
        this.totalSegments = totalSegments;
        this.segment = segment;
    }

    @SuppressWarnings("finally")
    public ArrayList<String> call() 
    {
        System.out.println("Scanning " + tableName + " segment " + segment + " out of " + totalSegments + " segments " + itemLimit + " items at a time...");
        Map<String, AttributeValue> exclusiveStartKey = null;

        try 
        {
            while(true) 
            {
                ScanRequest scanRequest = new ScanRequest()
                    .withTableName(tableName)
                    .withLimit(itemLimit)
                    .withExclusiveStartKey(exclusiveStartKey)
                    .withTotalSegments(totalSegments)
                    .withSegment(segment);

                ScanResult result = client.scan(scanRequest);

                for(Map<String, AttributeValue> item : result.getItems())
                {
                    list_2.add("Set:"+counter);
                    for (Map.Entry<String, AttributeValue> getItem : item.entrySet()) 
                    {
                        String attributeName = getItem.getKey();
                        AttributeValue value = getItem.getValue();

                        list_2.add(attributeName
                                + (value.getS() == null ? "" : ":" + value.getS())
                                + (value.getN() == null ? "" : ":" + value.getN())
                                + (value.getB() == null ? "" : ":" + value.getB())
                                + (value.getSS() == null ? "" : ":" + value.getSS())
                                + (value.getNS() == null ? "" : ":" + value.getNS())
                                + (value.getBS() == null ? "" : ":" + value.getBS()));
                    }
                    counter += 1;
                }

                exclusiveStartKey = result.getLastEvaluatedKey();
                if (exclusiveStartKey == null) 
                {
                    break;
                }
            }
        } 
        catch (AmazonServiceException ase) 
        {
            System.err.println(ase.getMessage());
        } 
        finally 
        {
            return list_2;
        }
    }
}

Final EDIT: 最终编辑:

Function Call: 函数调用:

ScanSegmentTask task = null;
ArrayList<String> list = new ArrayList<String>();
ArrayList<Future<ArrayList<String>>> holdFuture = new ArrayList<Future<ArrayList<String>>>();

try
{
    ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
    int totalSegments = numberOfThreads;

    for (int segment = 0; segment < totalSegments; segment++) 
    {
        // Runnable task that will only scan one segment
        task = new ScanSegmentTask(tableName, itemLimit, totalSegments, segment);

        // Execute the task
        Future<ArrayList<String>> future = executor.submit(task);
        holdFuture.add(future);
    }

    for (int i = 0 ; i < holdFuture.size(); i++)
    {
        boolean flag = false;
        while(flag == false)
        {
            Thread.sleep(1000);
            if(holdFuture.get(i).isDone())
            {
                list.addAll(holdFuture.get(i).get());
                flag = true;
            }
        }
    }
    shutDownExecutorService(executor);
}

Class: private static class ScanSegmentTask implements Callable> { 类:私有静态类ScanSegmentTask实现Callable> {

    // DynamoDB table to scan
    private String tableName;

    // number of items each scan request should return
    private int itemLimit;

    // Total number of segments
    // Equals to total number of threads scanning the table in parallel
    private int totalSegments;

    // Segment that will be scanned with by this task
    private int segment;

    ArrayList<String> list_2 = new ArrayList<String>();

    static AtomicInteger counter = new AtomicInteger(0);

    public ScanSegmentTask(String tableName, int itemLimit, int totalSegments, int segment)
    {
        this.tableName = tableName;
        this.itemLimit = itemLimit;
        this.totalSegments = totalSegments;
        this.segment = segment;
    }

    @SuppressWarnings("finally")
    public ArrayList<String> call() 
    {
        System.out.println("Scanning " + tableName + " segment " + segment + " out of " + totalSegments + " segments " + itemLimit + " items at a time...");
        Map<String, AttributeValue> exclusiveStartKey = null;

        try 
        {
            while(true) 
            {
                ScanRequest scanRequest = new ScanRequest()
                    .withTableName(tableName)
                    .withLimit(itemLimit)
                    .withExclusiveStartKey(exclusiveStartKey)
                    .withTotalSegments(totalSegments)
                    .withSegment(segment);

                ScanResult result = client.scan(scanRequest);

                for(Map<String, AttributeValue> item : result.getItems())
                {
                    list_2.add("Set:"+counter);
                    for (Map.Entry<String, AttributeValue> getItem : item.entrySet()) 
                    {
                        String attributeName = getItem.getKey();
                        AttributeValue value = getItem.getValue();

                        list_2.add(attributeName
                                + (value.getS() == null ? "" : ":" + value.getS())
                                + (value.getN() == null ? "" : ":" + value.getN())
                                + (value.getB() == null ? "" : ":" + value.getB())
                                + (value.getSS() == null ? "" : ":" + value.getSS())
                                + (value.getNS() == null ? "" : ":" + value.getNS())
                                + (value.getBS() == null ? "" : ":" + value.getBS()));
                    }
                    counter.addAndGet(1);
                }

                exclusiveStartKey = result.getLastEvaluatedKey();
                if (exclusiveStartKey == null) 
                {
                    break;
                }
            }
        } 
        catch (AmazonServiceException ase) 
        {
            System.err.println(ase.getMessage());
        } 
        finally 
        {
            return list_2;
        }
    }
}

OK, I believe the issue is in the way you synchronized. 好的,我相信问题在于您进行同步的方式。

In your case, your lock is pretty much pointless, as each thread has its own lock, and so synchronizing never actually blocks one thread from running the same piece of code. 在您的情况下,锁几乎没有意义,因为每个线程都有自己的锁,因此同步实际上不会阻止一个线程运行同一段代码。 I believe that this is the reason that removing synchronization does not change the result -- because it never would have had an effect in the first place. 我认为,这就是删除同步不会改变结果的原因-因为它从一开始就不会产生效果。

I believe your issue is in fact due to the static ArrayList<String> that's shared by your threads. 我相信您的问题实际上是由于线程共享的static ArrayList<String> This is because ArrayList is actually not thread-safe, and so operations on it are not guaranteed to succeed; 这是因为ArrayList实际上不是线程安全的,因此不能保证对其进行操作; as a result, you have to synchronize operations to/from it. 结果,您必须与之同步操作。 Without proper synchronization, it could be possible to have two threads add something to an empty ArrayList , yet have the resulting ArrayList have a size of 1! 如果没有适当的同步,则可能有两个线程将某些内容添加到一个空ArrayList ,而得到的ArrayList的大小为1! (or at least if my memory hasn't failed me. I believe this is the case for non-thread-safe objects, though) (或者至少如果我的记忆没有让我失望。不过,我相信非线程安全对象就是这种情况)

As I said before, while you do have a synchronized block, it really isn't doing anything. 正如我之前所说,虽然您确实有一个synchronized块,但实际上并没有做任何事情。 You could synchronize on list_2 , but all that would do is effectively make all your threads run in sequence, as the lock on the ArrayList wouldn't be released until one of your threads was done. 可以list_2同步,但是所有要做的就是有效地使所有线程按顺序运行,因为只有完成一个线程后,才会释放ArrayList的锁。

There are a few solutions to this. 有一些解决方案。 You can use Collections.synchronizedList(list_2) to create a synchronized wrapper to your ArrayList . 您可以使用Collections.synchronizedList(list_2)ArrayList创建一个同步包装器。 This way, adding to the list is guaranteed to succeed. 这样,可以确保成功添加到列表中。 However, this induces a synchronization cost per operations, and so isn't ideal. 但是,这会导致每次操作的同步开销,因此并不理想。

What I would do is actually have ScanSegmentTask implement Callable (technically Callable<ArrayList<String>> . The Callable interface is almost exactly like the Runnable interface, except its method is call() , which returns a value . 我要做的实际上是让ScanSegmentTask实现Callable (技术上是Callable<ArrayList<String>>Callable接口几乎与Runnable接口完全一样,不同之ScanSegmentTask于其方法是call() ,该方法返回一个value

Why is this important? 为什么这很重要? I think that what would produce the best results for you is this: 我认为可以为您带来最佳效果的是:

  1. Make list_2 an instance variable, initialized to a blank list list_2设为实例变量,并初始化为空白列表
  2. Have each thread add to this list exactly as you have done 让每个线程完全像您一样将其添加到此列表中
  3. Return list_2 when you are done 完成后返回list_2
  4. Concatenate each resulting ArrayList<String> to the original ArrayList using addAll() 使用addAll()将每个结果ArrayList<String>连接到原始ArrayList

This way, you have no synchronization overhead to deal with! 这样,您就无需处理同步开销了!

This will require a few changes to your executor code. 这将需要对执行程序代码进行一些更改。 Instead of calling execute() , you'll need to call submit() . 除了调用execute() ,您还需要调用submit() This returns a Future object ( Future<ArrayList<String>> in your case) that holds the results of the call() method. 这将返回一个保存call()方法结果的Future对象(在您的情况下为Future<ArrayList<String>> call() You'll need to store this into some collection -- an array, ArrayList , doesn't matter. 您需要将其存储到某个集合中-数组ArrayList无关紧要。

To retrieve the results, simply loop through the collection of Future objects and call get() (I think). 要检索结果,只需遍历Future对象的集合并调用get() (我认为)。 This call will block until the thread that the Future object corresponds to is complete. 该调用将一直阻塞,直到Future对象所对应的线程完成为止。

I think that's it. 我想就是这样。 While this is more complicated, I think that this is be best performance you're going to get, as with enough threads either CPU contention or your network link will become the bottleneck. 尽管这更复杂,但我认为这是您将获得的最佳性能,因为如果有足够的线程,CPU争用或网络链接都将成为瓶颈。 Please ask if you have any questions, and I'll update as needed. 请询问您是否有任何问题,我会根据需要进行更新。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM