简体   繁体   中英

DynamoDB Parallel Scan - Java Synchronization

I'm trying to use the DynamoDB Parallel Scan Example:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LowLevelJavaScanning.html

I have 200,000 items, and I've taken the sequential code scan, and modified it slightly for my usage:

Map<String, AttributeValue> lastKeyEvaluated = null;
do
{
    ScanRequest scanRequest = new ScanRequest()
    .withTableName(tableName)
    .withExclusiveStartKey(lastKeyEvaluated);

    ScanResult result = client.scan(scanRequest);


    double counter = 0;
    for(Map<String, AttributeValue> item : result.getItems())
    {
        itemSerialize.add("Set:"+counter);
        for (Map.Entry<String, AttributeValue> getItem : item.entrySet()) 
        {
            String attributeName = getItem.getKey();
            AttributeValue value = getItem.getValue();

            itemSerialize.add(attributeName
                    + (value.getS() == null ? "" : ":" + value.getS())
                    + (value.getN() == null ? "" : ":" + value.getN())
                    + (value.getB() == null ? "" : ":" + value.getB())
                    + (value.getSS() == null ? "" : ":" + value.getSS())
                    + (value.getNS() == null ? "" : ":" + value.getNS())
                    + (value.getBS() == null ? "" : ":" + value.getBS()));
        }
        counter += 1;
    }

    lastKeyEvaluated = result.getLastEvaluatedKey();
}
while(lastKeyEvaluated != null);

The counter gives exactly 200,000 when this code has finished, however, I also wanted to try the parallel scan.

Function Call:

ScanSegmentTask task = null;
ArrayList<String> list = new ArrayList<String>();
try
{
    ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
    int totalSegments = numberOfThreads;

    for (int segment = 0; segment < totalSegments; segment++) 
    {
        // Runnable task that will only scan one segment
        task = new ScanSegmentTask(tableName, itemLimit, totalSegments, segment, list);

        // Execute the task
        executor.execute(task);
    }
    shutDownExecutorService(executor);
}
.......Catches something if error
return list;

Class:

I have a static list that the data is shared with all the threads. I was able to retrieve the lists, and output the amount of data.

// Runnable task for scanning a single segment of a DynamoDB table
private static class ScanSegmentTask implements Runnable 
{

    // DynamoDB table to scan
    private String tableName;

    // number of items each scan request should return
    private int itemLimit;

    // Total number of segments
    // Equals to total number of threads scanning the table in parallel
    private int totalSegments;

    // Segment that will be scanned with by this task
    private int segment;

    static ArrayList<String> list_2;

    Object lock = new Object();

    public ScanSegmentTask(String tableName, int itemLimit, int totalSegments, int segment, ArrayList<String> list) 
    {
        this.tableName = tableName;
        this.itemLimit = itemLimit;
        this.totalSegments = totalSegments;
        this.segment = segment;
        list_2 = list;
    }

    public void run() 
    {
        System.out.println("Scanning " + tableName + " segment " + segment + " out of " + totalSegments + " segments " + itemLimit + " items at a time...");
        Map<String, AttributeValue> exclusiveStartKey = null;
        int totalScannedItemCount = 0;
        int totalScanRequestCount = 0;
        int counter = 0;

        try 
        {
            while(true) 
            {
                ScanRequest scanRequest = new ScanRequest()
                    .withTableName(tableName)
                    .withLimit(itemLimit)
                    .withExclusiveStartKey(exclusiveStartKey)
                    .withTotalSegments(totalSegments)
                    .withSegment(segment);

                ScanResult result = client.scan(scanRequest);

                totalScanRequestCount++;
                totalScannedItemCount += result.getScannedCount();

                synchronized(lock)
                {
                    for(Map<String, AttributeValue> item : result.getItems())
                    {
                        list_2.add("Set:"+counter);
                        for (Map.Entry<String, AttributeValue> getItem : item.entrySet()) 
                        {
                            String attributeName = getItem.getKey();
                            AttributeValue value = getItem.getValue();

                            list_2.add(attributeName
                                    + (value.getS() == null ? "" : ":" + value.getS())
                                    + (value.getN() == null ? "" : ":" + value.getN())
                                    + (value.getB() == null ? "" : ":" + value.getB())
                                    + (value.getSS() == null ? "" : ":" + value.getSS())
                                    + (value.getNS() == null ? "" : ":" + value.getNS())
                                    + (value.getBS() == null ? "" : ":" + value.getBS()));
                        }
                        counter += 1;
                    }
                }

                exclusiveStartKey = result.getLastEvaluatedKey();
                if (exclusiveStartKey == null) 
                {
                    break;
                }
            }
        } 
        catch (AmazonServiceException ase) 
        {
            System.err.println(ase.getMessage());
        } 
        finally 
        {
            System.out.println("Scanned " + totalScannedItemCount + " items from segment " + segment + " out of " + totalSegments + " of " + tableName + " with " + totalScanRequestCount + " scan requests");
        }
    }
}

Executor Service Shut Down:

public static void shutDownExecutorService(ExecutorService executor) 
{
    executor.shutdown();
    try 
    {
        if (!executor.awaitTermination(10, TimeUnit.SECONDS)) 
        {
            executor.shutdownNow();
        }
    } 
    catch (InterruptedException e) 
    {
        executor.shutdownNow();
        Thread.currentThread().interrupt();
    }
}

However, the amount of items changes every time I run this piece of code (Varies around 60000 in total, 6000 per threads, with 10 created threads). Removing synchronization does not change the result too.

Is there a bug with the synchronization or with the Amazon AWS API?

Thanks All

EDIT:

The new function call:

ScanSegmentTask task = null;
ArrayList<String> list = new ArrayList<String>();

try
{
    ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
    int totalSegments = numberOfThreads;

    for (int segment = 0; segment < totalSegments; segment++) 
    {
        // Runnable task that will only scan one segment
        task = new ScanSegmentTask(tableName, itemLimit, totalSegments, segment);

        // Execute the task
        Future<ArrayList<String>> future = executor.submit(task);

        list.addAll(future.get());
    }
    shutDownExecutorService(executor);
}

The new class:

// Runnable task for scanning a single segment of a DynamoDB table
private static class ScanSegmentTask implements Callable<ArrayList<String>>
{

    // DynamoDB table to scan
    private String tableName;

    // number of items each scan request should return
    private int itemLimit;

    // Total number of segments
    // Equals to total number of threads scanning the table in parallel
    private int totalSegments;

    // Segment that will be scanned with by this task
    private int segment;

    ArrayList<String> list_2 = new ArrayList<String>();

    static int counter = 0;

    public ScanSegmentTask(String tableName, int itemLimit, int totalSegments, int segment)
    {
        this.tableName = tableName;
        this.itemLimit = itemLimit;
        this.totalSegments = totalSegments;
        this.segment = segment;
    }

    @SuppressWarnings("finally")
    public ArrayList<String> call() 
    {
        System.out.println("Scanning " + tableName + " segment " + segment + " out of " + totalSegments + " segments " + itemLimit + " items at a time...");
        Map<String, AttributeValue> exclusiveStartKey = null;

        try 
        {
            while(true) 
            {
                ScanRequest scanRequest = new ScanRequest()
                    .withTableName(tableName)
                    .withLimit(itemLimit)
                    .withExclusiveStartKey(exclusiveStartKey)
                    .withTotalSegments(totalSegments)
                    .withSegment(segment);

                ScanResult result = client.scan(scanRequest);

                for(Map<String, AttributeValue> item : result.getItems())
                {
                    list_2.add("Set:"+counter);
                    for (Map.Entry<String, AttributeValue> getItem : item.entrySet()) 
                    {
                        String attributeName = getItem.getKey();
                        AttributeValue value = getItem.getValue();

                        list_2.add(attributeName
                                + (value.getS() == null ? "" : ":" + value.getS())
                                + (value.getN() == null ? "" : ":" + value.getN())
                                + (value.getB() == null ? "" : ":" + value.getB())
                                + (value.getSS() == null ? "" : ":" + value.getSS())
                                + (value.getNS() == null ? "" : ":" + value.getNS())
                                + (value.getBS() == null ? "" : ":" + value.getBS()));
                    }
                    counter += 1;
                }

                exclusiveStartKey = result.getLastEvaluatedKey();
                if (exclusiveStartKey == null) 
                {
                    break;
                }
            }
        } 
        catch (AmazonServiceException ase) 
        {
            System.err.println(ase.getMessage());
        } 
        finally 
        {
            return list_2;
        }
    }
}

Final EDIT:

Function Call:

ScanSegmentTask task = null;
ArrayList<String> list = new ArrayList<String>();
ArrayList<Future<ArrayList<String>>> holdFuture = new ArrayList<Future<ArrayList<String>>>();

try
{
    ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
    int totalSegments = numberOfThreads;

    for (int segment = 0; segment < totalSegments; segment++) 
    {
        // Runnable task that will only scan one segment
        task = new ScanSegmentTask(tableName, itemLimit, totalSegments, segment);

        // Execute the task
        Future<ArrayList<String>> future = executor.submit(task);
        holdFuture.add(future);
    }

    for (int i = 0 ; i < holdFuture.size(); i++)
    {
        boolean flag = false;
        while(flag == false)
        {
            Thread.sleep(1000);
            if(holdFuture.get(i).isDone())
            {
                list.addAll(holdFuture.get(i).get());
                flag = true;
            }
        }
    }
    shutDownExecutorService(executor);
}

Class: private static class ScanSegmentTask implements Callable> {

    // DynamoDB table to scan
    private String tableName;

    // number of items each scan request should return
    private int itemLimit;

    // Total number of segments
    // Equals to total number of threads scanning the table in parallel
    private int totalSegments;

    // Segment that will be scanned with by this task
    private int segment;

    ArrayList<String> list_2 = new ArrayList<String>();

    static AtomicInteger counter = new AtomicInteger(0);

    public ScanSegmentTask(String tableName, int itemLimit, int totalSegments, int segment)
    {
        this.tableName = tableName;
        this.itemLimit = itemLimit;
        this.totalSegments = totalSegments;
        this.segment = segment;
    }

    @SuppressWarnings("finally")
    public ArrayList<String> call() 
    {
        System.out.println("Scanning " + tableName + " segment " + segment + " out of " + totalSegments + " segments " + itemLimit + " items at a time...");
        Map<String, AttributeValue> exclusiveStartKey = null;

        try 
        {
            while(true) 
            {
                ScanRequest scanRequest = new ScanRequest()
                    .withTableName(tableName)
                    .withLimit(itemLimit)
                    .withExclusiveStartKey(exclusiveStartKey)
                    .withTotalSegments(totalSegments)
                    .withSegment(segment);

                ScanResult result = client.scan(scanRequest);

                for(Map<String, AttributeValue> item : result.getItems())
                {
                    list_2.add("Set:"+counter);
                    for (Map.Entry<String, AttributeValue> getItem : item.entrySet()) 
                    {
                        String attributeName = getItem.getKey();
                        AttributeValue value = getItem.getValue();

                        list_2.add(attributeName
                                + (value.getS() == null ? "" : ":" + value.getS())
                                + (value.getN() == null ? "" : ":" + value.getN())
                                + (value.getB() == null ? "" : ":" + value.getB())
                                + (value.getSS() == null ? "" : ":" + value.getSS())
                                + (value.getNS() == null ? "" : ":" + value.getNS())
                                + (value.getBS() == null ? "" : ":" + value.getBS()));
                    }
                    counter.addAndGet(1);
                }

                exclusiveStartKey = result.getLastEvaluatedKey();
                if (exclusiveStartKey == null) 
                {
                    break;
                }
            }
        } 
        catch (AmazonServiceException ase) 
        {
            System.err.println(ase.getMessage());
        } 
        finally 
        {
            return list_2;
        }
    }
}

OK, I believe the issue is in the way you synchronized.

In your case, your lock is pretty much pointless, as each thread has its own lock, and so synchronizing never actually blocks one thread from running the same piece of code. I believe that this is the reason that removing synchronization does not change the result -- because it never would have had an effect in the first place.

I believe your issue is in fact due to the static ArrayList<String> that's shared by your threads. This is because ArrayList is actually not thread-safe, and so operations on it are not guaranteed to succeed; as a result, you have to synchronize operations to/from it. Without proper synchronization, it could be possible to have two threads add something to an empty ArrayList , yet have the resulting ArrayList have a size of 1! (or at least if my memory hasn't failed me. I believe this is the case for non-thread-safe objects, though)

As I said before, while you do have a synchronized block, it really isn't doing anything. You could synchronize on list_2 , but all that would do is effectively make all your threads run in sequence, as the lock on the ArrayList wouldn't be released until one of your threads was done.

There are a few solutions to this. You can use Collections.synchronizedList(list_2) to create a synchronized wrapper to your ArrayList . This way, adding to the list is guaranteed to succeed. However, this induces a synchronization cost per operations, and so isn't ideal.

What I would do is actually have ScanSegmentTask implement Callable (technically Callable<ArrayList<String>> . The Callable interface is almost exactly like the Runnable interface, except its method is call() , which returns a value .

Why is this important? I think that what would produce the best results for you is this:

  1. Make list_2 an instance variable, initialized to a blank list
  2. Have each thread add to this list exactly as you have done
  3. Return list_2 when you are done
  4. Concatenate each resulting ArrayList<String> to the original ArrayList using addAll()

This way, you have no synchronization overhead to deal with!

This will require a few changes to your executor code. Instead of calling execute() , you'll need to call submit() . This returns a Future object ( Future<ArrayList<String>> in your case) that holds the results of the call() method. You'll need to store this into some collection -- an array, ArrayList , doesn't matter.

To retrieve the results, simply loop through the collection of Future objects and call get() (I think). This call will block until the thread that the Future object corresponds to is complete.

I think that's it. While this is more complicated, I think that this is be best performance you're going to get, as with enough threads either CPU contention or your network link will become the bottleneck. Please ask if you have any questions, and I'll update as needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM