简体   繁体   中英

C# masking an array to exclude indexes as fast as in python

What I have:

//This data set contains columns (second index) having the same value in each row (first index)
double[][] dataSet = new double[][]
{
  new double[] {1, 2, 3, 4},
  new double[] {5, 6, 7, 4},
  new double[] {8, 9, 10, 4},
}; 

What i want to get:

// This data set has no column where the value in each row is the same
double[][] reducedDataSet = new double[][]
{
  new double[] {1, 2, 3},
  new double[] {5, 6, 7},
  new double[] {8, 9, 10},
}; 

In python this can be easily done by:

all_equal_value_indices = numpy.all(data_set == data_set[0, :], axis=0) // Finds the indices of all columns that have equal values in each row
reduced_data_set = data_set[:, ~all_equal_value_indices] // Takes all rows with only those columns where all_equal_value_indices is not 1

In C# I can get an array containing the indices that should be excluded relatively fast, but how can I use these indices as mask to get only those columns not contained in these indices?

What i tried:

var numberOfDeletedColumns = 0;
var reducedDataSet = dataSet;

foreach (var columnToDelete in columnsToDelete)
{
  reducedDataSet = reducedDataSet.RemoveColumn(columnToDelete - numberOfDeletedColumns++);
}

RemoveColumn is an extension provided by Accord.Net and has the following code:

/// <summary>Returns a new matrix without one of its columns.</summary>
public static T[][] RemoveColumn<T>(this T[][] matrix, int index)
{
  T[][] objArray = new T[matrix.Length][];
  for (int index1 = 0; index1 < matrix.Length; ++index1)
  {
    objArray[index1] = new T[matrix[index1].Length - 1];
    for (int index2 = 0; index2 < index; ++index2)
      objArray[index1][index2] = matrix[index1][index2];
    for (int index2 = index + 1; index2 < matrix[index1].Length; ++index2)
      objArray[index1][index2 - 1] = matrix[index1][index2];
  }
  return objArray;
}

But this is much slower than the implementation in python. Could someone suggest a faster method to achieve a reduced data set?

Sometimes, components don't work nicely in aggregate. In this case, your remove column function is reallocating the entire matrix, so the operation is linear in the number of columns you want to remove (ouch). To fix this, remove all the columns in one pass.

class Program
{
    static void Main(string[] args)
    {
        var matrix = new[]
        {
            new [] {1, 2, 3, 4, 5},
            new [] {1, 2, 3, 4, 5},
            new [] {1, 2, 3, 4, 5},
            new [] {1, 2, 3, 4, 5},
        };

        var result = matrix.DeleteColums(new [] {0, 2, 4});

        foreach (var row in result)
        {
            foreach (var column in row)
            {
                Console.Write(column);
                Console.Write(" ");
            }
            Console.WriteLine();
        }

        Console.ReadKey();
    }
}

static class MatrixHelper
{
    public static T[][] DeleteColums<T>(this T[][] matrix, int[] columns)
    {
        var sorted = columns.Distinct().OrderBy(e => e).Concat(new [] { int.MaxValue }).ToArray();
        var target = new T[matrix.Length][];

        for (int row = 0; row < matrix.Length; row++)
        {
            var sourceRow = matrix[row];
            var targetRow = new T[sourceRow.Length - columns.Length];
            var sortedIndex = 0;
            for (int i = 0; i < sourceRow.Length; i++)
            {
                if (i == sorted[sortedIndex])
                {
                    sortedIndex++;
                    continue;
                }

                targetRow[i - sortedIndex] = sourceRow[i];
            }
            target[row] = targetRow;
        }

        return target;
    }
}

If this isn't enough, then you'll need to think about whether you need to use arrays. For instance, you could have a data structure for your matrix that dynamically masks columns, instead of an array-of-arrays.

UPDATE:

Given that other solutions in this page have assumed that the matrix, despite being represented by a jagged array, has the same indices per row, I thought I would give another go at making a faster solution. Here are two solutions that beat all previous ones in this thread, under those assumptions, including a faster parallel one.

public static T[][] DeleteColumns<T>(this T[][] matrix, int[] columns)
    {
        if (matrix.Length == 0) return matrix;

        //Previous code assumed matrix could be jagged - new code assumes all columns 
        //present and all rows same length
        var rowLength = matrix[0].Length;

        if (rowLength == 0) return matrix;

        var sorted = columns.Distinct().ToArray();
        var target = new T[matrix.Length][];
        var remainingLength = rowLength - sorted.Length;

        //Allocate the targets all in one go - to avoid doing allocation in parallel.
        for (var row = 0; row < matrix.Length; row++)
        {
            target[row] = new T[remainingLength];
        }

        //Work out remaining columns (previous code assumed these could 
        //be different per row, this assumes all rows have the same
        //contents.
        var remaining = Enumerable.Range(0, rowLength).Except(sorted).ToArray();

        for (int row = 0; row < matrix.Length; row++)
        {
            var sourceRow = matrix[row];
            var targetRow = target[row];
            for (int i = 0; i < targetRow.Length; i++)
            {
                targetRow[i] = sourceRow[remaining[i]];
            }
        }

        return target;
    }

And the faster parallel one (allocation for the parallel one is now about 90% of the total time):

public static T[][] DeleteColumnsParallel<T>(this T[][] matrix, int[] columns)
    {
        if (matrix.Length == 0) return matrix;

        //Previous code assumed matrix could be jagged - new code assumes all columns 
        //present and all rows same length
        var rowLength = matrix[0].Length;

        if (rowLength == 0) return matrix;

        var sorted = columns.Distinct().ToArray();
        var target = new T[matrix.Length][];
        var remainingLength = rowLength - sorted.Length;

        //Allocate the targets all in one go - to avoid doing allocation in parallel.
        for (var row = 0; row < matrix.Length; row++)
        {
            target[row] = new T[remainingLength];
        }

        //Work out remaining columns (previous code assumed these could 
        //be different per row, this assumes all rows have the same
        //contents.
        var remaining = Enumerable.Range(0, rowLength).Except(sorted).ToArray();

        Parallel.For(0, matrix.Length, row =>
        {
            var sourceRow = matrix[row];
            var targetRow = target[row];
            for (int i = 0; i < targetRow.Length; i++)
            {
                targetRow[i] = sourceRow[remaining[i]];
            }
        });

        return target;
    }

Results for 10000x10000 matrix with half the columns randomly removed.

  • My previous attempt: 1300ms
  • Best previous attempt: 450ms
  • My new serial version: 390ms
  • My new parallel version: 310ms
  • Time just to allocate the result matrix (ie lower bound on best achievable time): 265ms

But, I think it's important to call out a far, far faster solution. At the moment, in the fastest parallel solution, 90% of the time is spent allocating memory. If, on the other hand, you were to make a Matrix class that had it's own indexer, you would be able to dynamically pretend that certain columns of the underlying data structure didn't exist. Depending on how you're using the matrices, versus how often you're masking rows or columns, this could be dramatically faster.

Array.Copy helps it run about 2x faster on my computer.

static T[][] FastRemoveColumn<T>(T[][] matrix, int index)
{
    T[][] objArray = new T[matrix.Length][];
    for (int i = 0; i < matrix.Length; i++)
    {
        var line = matrix[i];
        var reducedline = new T[line.Length - 1];
        Array.Copy(line, 0, reducedline, 0, index);
        Array.Copy(line, index + 1, reducedline, index, line.Length - index - 1);
        objArray[i] = reducedline;                
    }
    return objArray;
}

and I also tried multithread. It runs very slow:

static T[][] MultiThreadRemoveColumn<T>(T[][] matrix, int index)
{
    T[][] objArray = new T[matrix.Length][];
    Parallel.For(0, matrix.Length, i =>
    {
        var line = matrix[i];
        var reducedline = new T[line.Length - 1];
        Array.Copy(line, 0, reducedline, 0, index);
        Array.Copy(line, index + 1, reducedline, index, line.Length - index - 1);
        objArray[i] = reducedline;                
    });
    return objArray;
}

Test:

// init
double[][] arr = new double[2000][];
for (int i = 0; i < arr.Length; i++)            
    arr[i] = new double[2000];

double v = 0;
for (int i = 0; i < arr.Length; i++)
{
    for (int j = 0; j < arr[i].Length; j++)
    {
        arr[i][j] = v;
        v++;
    }
}

Stopwatch sw = Stopwatch.StartNew();
var reducedArr = RemoveColumn(arr, 200);
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
sw.Restart();
var reducedArr2 = FastRemoveColumn(arr, 200);    
sw.Stop();        
Console.WriteLine(sw.ElapsedMilliseconds);
sw.Restart();
var reducedArr3 = MultiThreadRemoveColumn(arr, 200); 
sw.Stop();     
Console.WriteLine(sw.ElapsedMilliseconds);

// Check the result
for (int i = 0; i < reducedArr.Length; i++)
{
    for (int j = 0; j < reducedArr[i].Length; j++)
    {
        if(reducedArr[i][j] != reducedArr2[i][j]) throw new Exception();
        if(reducedArr[i][j] != reducedArr3[i][j]) throw new Exception();   
    }
}

Update

Solution to remove several columns:

public static T[][] DeleteColumns<T>(T[][] matrix, int[] columns)
{
    if (matrix.Length == 0) return new T[0][];
    bool[] delColumns = new bool[matrix[0].Length];
    foreach (int col in columns) delColumns[col] = true;
    List<int> remainCols = new List<int>();
    for (int i = 0; i < delColumns.Length; i++)
    {
        if (!delColumns[i]) remainCols.Add(i);
    }
    var target = new T[matrix.Length][];
    for (int rowIndex = 0; rowIndex < matrix.Length; rowIndex++)
    {
        T[] sourceRow = matrix[rowIndex];
        T[] targetRow = new T[remainCols.Count];
        for (int i = 0; i < remainCols.Count; i++)
        {
            targetRow[i] = sourceRow[remainCols[i]];
        }
        target[rowIndex] = targetRow;    
    }
    return target;
}

Test on a 2000x2000 matrix. Comparing with Adam Brown's solution, testing removing all columns is absolutely unfair, but my solution is faster even if removing only one column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM