简体   繁体   中英

Best way to convert thread safe collection to DataTable?

So here is the scenario:

I have to take a group of data, process it, build an object and then insert those objects into a database.

In order to increase performance, I am multi-threading the processing of the data using a parallel loop and storing the objects in a CollectionBag list.

That part works fine. However, the issue here is I now need to take that list, convert it into a DataTable object and insert the data into the database. It's very ugly and I feel like I'm not doing this in the best way possible (pseudo below):

ConcurrentBag<FinalObject> bag = new ConcurrentBag<FinalObject>();

ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = Environment.ProcessorCount;

Parallel.ForEach(allData, parallelOptions, dataObj =>
{   
    .... Process data ....

    bag.Add(theData);

    Thread.Sleep(100);
});

DataTable table = createTable();
foreach(FinalObject moveObj in bag) {
    table.Rows.Add(moveObj.x);
}

This is a good candidate for PLINQ (or Rx - I'll focus on PLINQ since it's part of the Base Class Library).

IEnumerable<FinalObject> bag = allData
    .AsParallel()
    .WithDegreeOfParallelism(Environment.ProcessorCount)
    .Select(dataObj =>
    {
        FinalObject theData = Process(dataObj);

        Thread.Sleep(100);

        return theData;
    });

DataTable table = createTable();

foreach (FinalObject moveObj in bag)
{
    table.Rows.Add(moveObj.x);
}

Realistically, instead of throttling the loop via Thread.Sleep , you should be limiting the maximum degree of parallelism further until you get the CPU usage down to the desired level.

Disclaimer: all of the below is meant for entertainment only, although it does actually work.

Of course you can always kick it up a notch and produce a full-on async Parallel.ForEach implementation that allows you to process input in parallel and do your throttling asynchronously, without blocking any thread pool threads.

async Task ParallelForEachAsync<TInput, TResult>(IEnumerable<TInput> input,
                                                 int maxDegreeOfParallelism,
                                                 Func<TInput, Task<TResult>> body,
                                                 Action<TResult> onCompleted)
{
    Queue<TInput> queue = new Queue<TInput>(input);

    if (queue.Count == 0) {
        return;
    }

    List<Task<TResult>> tasksInFlight = new List<Task<TResult>>(maxDegreeOfParallelism);

    do
    {
        while (tasksInFlight.Count < maxDegreeOfParallelism && queue.Count != 0)
        {
            TInput item = queue.Dequeue();
            Task<TResult> task = body(item);

            tasksInFlight.Add(task);
        }

        Task<TResult> completedTask = await Task.WhenAny(tasksInFlight).ConfigureAwait(false);

        tasksInFlight.Remove(completedTask);

        TResult result = completedTask.GetAwaiter().GetResult(); // We know the task has completed. No need for await.

        onCompleted(result);
    }
    while (queue.Count != 0 || tasksInFlight.Count != 0);
}

Usage ( full Fiddle here ):

async Task<DataTable> ProcessAllAsync(IEnumerable<InputObject> allData)
{
    DataTable table = CreateTable();
    int maxDegreeOfParallelism = Environment.ProcessorCount;

    await ParallelForEachAsync(
        allData,
        maxDegreeOfParallelism,
        // Loop body: these Tasks will run in parallel, up to {maxDegreeOfParallelism} at any given time.
        async dataObj =>
        {
            FinalObject o = await Task.Run(() => Process(dataObj)).ConfigureAwait(false); // Thread pool processing.

            await Task.Delay(100).ConfigureAwait(false); // Artificial throttling.

            return o;
        },
        // Completion handler: these will be executed one at a time, and can safely mutate shared state.
        moveObj => table.Rows.Add(moveObj.x)
    );

    return table;
}

struct InputObject
{
    public int x;
}

struct FinalObject
{
    public int x;
}

FinalObject Process(InputObject o)
{
    // Simulate synchronous work.
    Thread.Sleep(100);

    return new FinalObject { x = o.x };
}

Same behaviour, but without Thread.Sleep and ConcurrentBag<T> .

I think something like this should give better performance, looks like object[] is a better option than DataRow as you need DataTable to get a DataRow object.

ConcurrentBag<object[]> bag = new ConcurrentBag<object[]>();

Parallel.ForEach(allData, 
    new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, 
    dataObj =>
{
    object[] row = new object[colCount];

    //do processing

    bag.Add(row);

    Thread.Sleep(100);
});

DataTable table = createTable();
foreach (object[] row in bag)
{
    table.Rows.Add(row);
}

Sounds like you've complicated things quite a bit by tring to make everything run in parallel, but if you store DataRow obejcts in your bag instead of plain objects, at the end you can use DataTableExtensions to create a DataTable from a generic collection quite easily:

var dataTable = bag.CopyToDataTable();

Just add a reference to System.Data.DataSetExtensions in your project.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM