简体   繁体   English

从整数和小数字符串中删除字符的最有效方法

[英]most efficient way to remove characters from a string of integers and decimals

I'm processing raw US Census data into a SQL Server database.我正在将原始美国人口普查数据处理到 SQL Server 数据库中。 The tar file when unzipped yields a little over 14,000 CSV files needing to be processed into 266 different database tables.解压后的 tar 文件会产生 14,000 多个 CSV 文件,需要将它们处理成 266 个不同的数据库表。 I have to loop over each CSV file and append a header to the file so SSIS can ETL the raw data into a targeted SQL Server table.我必须遍历每个 CSV 文件并将标题附加到文件中,以便 SSIS 可以将原始数据 ETL 到目标 SQL Server 表中。

Each CSV file's first 6 columns are exactly the same.每个 CSV 文件的前 6 列完全相同。 The remaining columns per file are different.每个文件的其余列是不同的。 The data in the remaining columns are mostly numeric values (integers and decimals).其余列中的数据主要是数值(整数和小数)。 However, the Census Bureau adds characters called 'jam' values representing why there is no value.但是,人口普查局添加了称为“jam”值的字符,表示没有值的原因。 I need to replace these jam values with null or an empty string because the target database table columns are DECIMALS and jam values cause SSIS to fail insertion.我需要用 null 或空字符串替换这些 jam 值,因为目标数据库表列是 DECIMALS 并且 jam 值导致 SSIS 插入失败。

So, I have a C# (DotNet Core) class library looping over 14K files.所以,我有一个 C#(DotNet Core)类库,可以循环处理 14K 文件。 For each file I have to do the following:对于每个文件,我必须执行以下操作:

  1. create a StringBuilder variable创建一个 StringBuilder 变量
  2. append row header to StringBuilder so SSIS works将行标题附加到 StringBuilder 以便 SSIS 工作
  3. loop over each row in file遍历文件中的每一行
  4. for each row, I have to split out the first 6 columns because I need those strings in target table.对于每一行,我必须拆分前 6 列,因为我需要目标表中的这些字符串。 I then split out remaining columns because I have to remove jam values leaving numeric data然后我拆分出剩余的列,因为我必须删除留下数字数据的果酱值
  5. combine first 6 columns and cleaned data back into a row将前 6 列和清理过的数据合并成一行
  6. append newly cleaned row to StringBuilder将新清理的行附加到 StringBuilder
  7. after finishing looping through all rows, write StringBuilder to destination folder where SSIS will load into database.完成所有行的循环后,将 StringBuilder 写入 SSIS 将加载到数据库的目标文件夹。

I have 3 nested loops:我有 3 个嵌套循环:

  1. loop over 14000 files循环超过 14000 个文件
  2. for each file, loop over each row对于每个文件,循环每一行
  3. for each row, loop over columns removing characters对于每一行,循环列删除字符

Here's my code for looping over each file:这是我循环遍历每个文件的代码:

    private static Boolean BuildCensusDataFileWithHeader(String censusDataFilePath, String rowHeader, String censusDataDestinationFilePath)
    {
        try
        {
            // BUILD NEW FILE WITH HEADER
            StringBuilder currentContent = new StringBuilder();
            currentContent.Append(rowHeader + Environment.NewLine);

            //RETRIEVE ALL LINES IN TARGET FILE
            List<String> rawList = File.ReadAllLines(censusDataFilePath).ToList();

            // LOOP THROUGH EACH LINE AND REMOVE ANY STRINGS IN COLUMNS AFTER COLUMN 6
            // NOTE: COLUMNS 1-6 CONTAINS STRINGS NEEDED IN DATABASE
            foreach (var row in rawList)
            {
                //TURN COMMA DELIMITED ROW OF DATA INTO ARRAY
                String[] rowArray = row.Split(",");

                // PEEL OFF FIRST 6 COLUMNS TO BE KEPT AS IS
                IList<String> goodStrings = rowArray.Take(6).ToList();

                // RETRIEVE REMAINING COLUMNS TO BE CLEANED OF STRINGS
                IList<String> stringsToNullList = rowArray.Skip(6).ToList();

                // REMOVE ALL STRINGS
                stringsToNullList.OnlyDecimalValues();

                // PUT GOOD COLUMNS AND CLEANED COLUMNS BACK TOGETHER AS A ROW
                var cleanedRow = $"{String.Join(",", goodStrings)},{String.Join(",", stringsToNullList)}";

                // APPEND ROW TO NEW DOCUMENT TO BE WRITTEN TO TARGET DIRECTORRY CONTAINING CLEANED DATA
                currentContent.Append(cleanedRow + Environment.NewLine);
            }

            File.WriteAllText(censusDataDestinationFilePath, currentContent.ToString());

            return true;
        }
        catch (Exception ee)
        {
            string temp = ee.Message;
            return false;
        }
    }

Here's my extension methods replacing characters with empty space:这是我用空格替换字符的扩展方法:

    public static void OnlyDecimalValues(this IList<String> stringToClean)
    {
        for (int i = 0; i < stringToClean.Count; ++i)
        {
            stringToClean[i] = (stringToClean[i].IsDecimal()) ? stringToClean[i] : "";
        }
    }

    public static bool IsDecimal(this string text)
    {
        decimal test;
        return decimal.TryParse(text, out test);
    }

This is all working through brute force programming.这一切都是通过蛮力编程来实现的。 Is there a more efficient way to do this?有没有更有效的方法来做到这一点?

Thank you for your time.感谢您的时间。

I have two suggestions to speed it up.我有两个建议可以加快速度。 First, since you don't do anything with a resulting parsed decimal value, you can use a regular expression to check if string contains numbers only.首先,由于您不对结果解析的十进制值执行任何操作,因此您可以使用正则表达式来检查字符串是否仅包含数字。 It is faster than using a TryParse.它比使用 TryParse 更快。 I used a Stopwatch to check the speed and this way yields a slightly better performance for "false" cases and significantly better performance for "true" cases.我使用秒表来检查速度,这种方式在“假”情况下产生了稍微更好的性能,而在“真”情况下产生了显着更好的性能。 So, IsDecimal method would become:因此, IsDecimal 方法将变为:

private static bool IsDecimal(string text)
{
    var regex = @"^-?(0|[1-9]\d*)(\.\d+)?$";
    return Regex.Match(text, regex).Success;
}

Second suggestion, is to transform an if-else block into just an if block.第二个建议是将 if-else 块转换为 if 块。 So, this line:所以,这一行:

stringToClean[i] = (stringToClean[i].IsDecimal()) ? stringToClean[i] : "";

would become this:会变成这样:

if (!stringToClean[i].IsDecimal())
{
    stringToClean[i] = "";
}

I recommend revisiting process design.我建议重新审视流程设计。 Use the power of sql and ssis in the right balance.在适当的平衡中使用 sql 和 ssis 的力量。 Use ssis to loop through all files in the folder and to load raw text rows into newly created raw table.使用 ssis 遍历文件夹中的所有文件并将原始文本行加载到新创建的原始表中。 Then use sql code to do the rest of the processing.然后用sql代码做剩下的处理。 You can use charindex or patIndex functions to split the raw rows and one benifit of SQL would be a massive reduction in runtime because you will be processing the entire batch in a single transaction for a given file.您可以使用 charindex 或 patIndex 函数来拆分原始行,SQL 的一个好处是大大减少运行时间,因为您将在单个事务中处理给定文件的整个批处理。

Another likely upside would be that you may just have to create one raw table for all the different files, with three columns - id, fileName , rawText.另一个可能的好处是,您可能只需要为所有不同的文件创建一个原始表,其中包含三列 - id、fileName 和 rawText。 So the design would look something like:所以设计看起来像:

Steps performed in SSIS在 SSIS 中执行的步骤

  • Create a StringBuilder variable.创建一个 StringBuilder 变量。 Append row header to StringBuilder so SSIS works loop over each row in file.将行标题附加到 StringBuilder,以便 SSIS 循环遍历文件中的每一行。

Steps performed in SQL在 SQL 中执行的步骤

  • Split out the first 6 columns to get strings in target table & split out remaining columns to remove jam values leaving numeric data using a single select statement using patindex or charindex functions combined with replace function for nullifying jam values.拆分前 6 列以获取目标表中的字符串并拆分剩余的列以删除卡纸值,使用单个选择语句使用patindexcharindex函数结合replace函数来消除卡纸值,从而留下数字数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM