[英]most efficient way to remove characters from a string of integers and decimals
I'm processing raw US Census data into a SQL Server database.我正在将原始美国人口普查数据处理到 SQL Server 数据库中。 The tar file when unzipped yields a little over 14,000 CSV files needing to be processed into 266 different database tables.解压后的 tar 文件会产生 14,000 多个 CSV 文件,需要将它们处理成 266 个不同的数据库表。 I have to loop over each CSV file and append a header to the file so SSIS can ETL the raw data into a targeted SQL Server table.我必须遍历每个 CSV 文件并将标题附加到文件中,以便 SSIS 可以将原始数据 ETL 到目标 SQL Server 表中。
Each CSV file's first 6 columns are exactly the same.每个 CSV 文件的前 6 列完全相同。 The remaining columns per file are different.每个文件的其余列是不同的。 The data in the remaining columns are mostly numeric values (integers and decimals).其余列中的数据主要是数值(整数和小数)。 However, the Census Bureau adds characters called 'jam' values representing why there is no value.但是,人口普查局添加了称为“jam”值的字符,表示没有值的原因。 I need to replace these jam values with null or an empty string because the target database table columns are DECIMALS and jam values cause SSIS to fail insertion.我需要用 null 或空字符串替换这些 jam 值,因为目标数据库表列是 DECIMALS 并且 jam 值导致 SSIS 插入失败。
So, I have a C# (DotNet Core) class library looping over 14K files.所以,我有一个 C#(DotNet Core)类库,可以循环处理 14K 文件。 For each file I have to do the following:对于每个文件,我必须执行以下操作:
I have 3 nested loops:我有 3 个嵌套循环:
Here's my code for looping over each file:这是我循环遍历每个文件的代码:
private static Boolean BuildCensusDataFileWithHeader(String censusDataFilePath, String rowHeader, String censusDataDestinationFilePath)
{
try
{
// BUILD NEW FILE WITH HEADER
StringBuilder currentContent = new StringBuilder();
currentContent.Append(rowHeader + Environment.NewLine);
//RETRIEVE ALL LINES IN TARGET FILE
List<String> rawList = File.ReadAllLines(censusDataFilePath).ToList();
// LOOP THROUGH EACH LINE AND REMOVE ANY STRINGS IN COLUMNS AFTER COLUMN 6
// NOTE: COLUMNS 1-6 CONTAINS STRINGS NEEDED IN DATABASE
foreach (var row in rawList)
{
//TURN COMMA DELIMITED ROW OF DATA INTO ARRAY
String[] rowArray = row.Split(",");
// PEEL OFF FIRST 6 COLUMNS TO BE KEPT AS IS
IList<String> goodStrings = rowArray.Take(6).ToList();
// RETRIEVE REMAINING COLUMNS TO BE CLEANED OF STRINGS
IList<String> stringsToNullList = rowArray.Skip(6).ToList();
// REMOVE ALL STRINGS
stringsToNullList.OnlyDecimalValues();
// PUT GOOD COLUMNS AND CLEANED COLUMNS BACK TOGETHER AS A ROW
var cleanedRow = $"{String.Join(",", goodStrings)},{String.Join(",", stringsToNullList)}";
// APPEND ROW TO NEW DOCUMENT TO BE WRITTEN TO TARGET DIRECTORRY CONTAINING CLEANED DATA
currentContent.Append(cleanedRow + Environment.NewLine);
}
File.WriteAllText(censusDataDestinationFilePath, currentContent.ToString());
return true;
}
catch (Exception ee)
{
string temp = ee.Message;
return false;
}
}
Here's my extension methods replacing characters with empty space:这是我用空格替换字符的扩展方法:
public static void OnlyDecimalValues(this IList<String> stringToClean)
{
for (int i = 0; i < stringToClean.Count; ++i)
{
stringToClean[i] = (stringToClean[i].IsDecimal()) ? stringToClean[i] : "";
}
}
public static bool IsDecimal(this string text)
{
decimal test;
return decimal.TryParse(text, out test);
}
This is all working through brute force programming.这一切都是通过蛮力编程来实现的。 Is there a more efficient way to do this?有没有更有效的方法来做到这一点?
Thank you for your time.感谢您的时间。
I have two suggestions to speed it up.我有两个建议可以加快速度。 First, since you don't do anything with a resulting parsed decimal value, you can use a regular expression to check if string contains numbers only.首先,由于您不对结果解析的十进制值执行任何操作,因此您可以使用正则表达式来检查字符串是否仅包含数字。 It is faster than using a TryParse.它比使用 TryParse 更快。 I used a Stopwatch to check the speed and this way yields a slightly better performance for "false" cases and significantly better performance for "true" cases.我使用秒表来检查速度,这种方式在“假”情况下产生了稍微更好的性能,而在“真”情况下产生了显着更好的性能。 So, IsDecimal method would become:因此, IsDecimal 方法将变为:
private static bool IsDecimal(string text)
{
var regex = @"^-?(0|[1-9]\d*)(\.\d+)?$";
return Regex.Match(text, regex).Success;
}
Second suggestion, is to transform an if-else block into just an if block.第二个建议是将 if-else 块转换为 if 块。 So, this line:所以,这一行:
stringToClean[i] = (stringToClean[i].IsDecimal()) ? stringToClean[i] : "";
would become this:会变成这样:
if (!stringToClean[i].IsDecimal())
{
stringToClean[i] = "";
}
I recommend revisiting process design.我建议重新审视流程设计。 Use the power of sql and ssis in the right balance.在适当的平衡中使用 sql 和 ssis 的力量。 Use ssis to loop through all files in the folder and to load raw text rows into newly created raw table.使用 ssis 遍历文件夹中的所有文件并将原始文本行加载到新创建的原始表中。 Then use sql code to do the rest of the processing.然后用sql代码做剩下的处理。 You can use charindex or patIndex functions to split the raw rows and one benifit of SQL would be a massive reduction in runtime because you will be processing the entire batch in a single transaction for a given file.您可以使用 charindex 或 patIndex 函数来拆分原始行,SQL 的一个好处是大大减少运行时间,因为您将在单个事务中处理给定文件的整个批处理。
Another likely upside would be that you may just have to create one raw table for all the different files, with three columns - id, fileName , rawText.另一个可能的好处是,您可能只需要为所有不同的文件创建一个原始表,其中包含三列 - id、fileName 和 rawText。 So the design would look something like:所以设计看起来像:
Steps performed in SSIS在 SSIS 中执行的步骤
Steps performed in SQL在 SQL 中执行的步骤
patindex
or charindex
functions combined with replace
function for nullifying jam values.拆分前 6 列以获取目标表中的字符串并拆分剩余的列以删除卡纸值,使用单个选择语句使用patindex
或charindex
函数结合replace
函数来消除卡纸值,从而留下数字数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.