SSIS add only rows that changed

Question

I have a project that consists in importing all the users (including all their properties) from an Active Directory domain to a SQL Server table. This table will be used by a Reporting Services application.

The table model has the following columns: -ID: (a unique identifier that is generated automatically). -distinguishedName: contains the LDAP distinguished Name attribute of the user. -attribute_name: contains the name of the user property. -attribute_value: contains the property values. -timestamp: contains a datetime value that is generated automatically.

I have created an SSIS package with a Script Task which contains a C# code that exports all the data to a .CSV that is imported into the table later by a Data Flow task. The project works without any problem, but generates more than 2 millions of rows (the AD domain has around 30.000 users and each user has between 100-200 properties).

The SSIS package should run every day and import data only when a new exists a new user property or a property value changed.

In order to do this, I created a data flow which copies the entire table into a recordset.

This recordset is converted to a datatable and used in a Script Component step which verfies if the current row exists in the datatable. If the row exists, compares the property values and returns the rows to the output only when the values are different or when the row is not found in the datatable. This is the code:

Blockquote

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    bool processRow = compareValues(Row);

    if (processRow)
    {
        //Direct to output 0
        Row.OutdistinguishedName = Row.distinguishedName.ToString();
        Row.Outattributename = Row.AttributeName.ToString();
        Row.Outattributevalue.AddBlobData(System.Text.Encoding.UTF8.GetBytes(Row.AttributeValue.ToString()));
    }
}

public bool compareValues(Input0Buffer Row)
{
    //Variable declaration
    DataTable dtHostsTbl = (DataTable)Variables.dataTableTbl;
    string expression = "", distinguishedName = Row.distinguishedName.ToString(), attribute_name = Row.AttributeName.ToString(), attribute_value = Row.AttributeValue.ToString();
    DataRow[] foundRowsHost = null;

    //Query datatable
    expression = "distinguishedName LIKE '" + distinguishedName + "' AND attribute_name LIKE '" + attribute_name + "'";
    foundRowsHost = dtHostsTbl.Select(expression);

    //Process found row
    if (foundRowsHost.Length > 0)
    {
        //Get the host id
        if (!foundRowsHost[0][2].ToString().Equals(attribute_value))
        {
            return true;
        }
        else
        {
            return false;
        }
    }
    else
    {
        return true;
    }
}

The code is working, but it's extremely slow. Is there any better way of doing this?

Answer 1

Here are some ideas:

Option A. (actually a combination of options)

Eliminate unnecessary data when querying Active Directory using whenChanged attribute. This alone should reduce the number of records significantly. If filtering by whenChanged is not possible, or in addition to this, consider the following steps.
Instead of importing all existing records into Recordset Destination - import them into a Cache Transform . Then use this Cache Transform in Cache connection manager of 2 Lookup components. One Lookup component verifies if the {distinguishedName,attribute_name} combination exists. (this will be insert then) Another Lookup component verifies if the {distinguishedName,attribute_name,attribute_value} combination exists.(this will be an update then, or delete/insert). This pair of lookups should replace your Skip rows which are in the table Script Component.
Evaluate if it is possible to reduce your columns sizes: attribute_name and attribute_value . Especially nvarchar(max) often spoils the party.
If you cannot reduce the size of attribute_name and attribute_value - consider storing their hashes and verifying if hashes changed instead of verifying values itself.
Remove the CSV step - just transfer data from your initial source which currently populates CSV to the lookups in one data flow, and whatever is not found in lookups - to your OLE DB Destination component.

Option B.

Check if the source, which reads from Active Directory, is fast itself. (Just run the data flow with that source alone, without any destination to measure its performance). If you are satisfied with its performance, and if you do not have objections against deleting everything from ad_User table - just delete and repopulate those 2 millions every day. Reading everything from AD and writing into the SQL Server, in the same data flow, without any change detection, might actually be the simplest and fastest option.

SSIS add only rows that changed

Question

1 answers

solution1
0 2015-11-21 03:19:49

SSIS add only rows that changed

Question

1 answers

solution1 0 2015-11-21 03:19:49

solution1
0 2015-11-21 03:19:49