简体   繁体   English

检查列表是否包含紧密匹配的字符串

[英]Check if list contains a string that matches closely

I'm trying to figure out the most efficient way to implement the following scenario: 我正在尝试找出实现以下情况的最有效方法:

I have a list like this: 我有一个这样的清单:

public static IEnumerable<string> ValidTags = new List<string> {
  "ABC.XYZ",
  "PQR.SUB.UID",
  "PQR.ALI.OBD",
};

I have a huge CSV with multiple columns. 我有一个包含多个列的大型CSV文件。 One of the column is tags . 列之一是tags This column either contains blank values, or one of the above values. 此列包含空白值或以上值之一。 The problem is, the tag column may contain values like " ABC.XYZ?@ " ie the valid tags plus some extraneous characters. 问题是,标签列可能包含“ ABC.XYZ?@ ”之类的值,即有效标签加上一些多余的字符。 I need to update such columns with the valid tag, since they " closely match " one of our valid tags. 我需要使用有效标签更新此类列,因为它们“ 紧密匹配 ”我们的有效标签之一。

Example: 例:

  • if the CSV contains PQR.ALI.OBD? CSV是否包含PQR.ALI.OBD? update it with the valid tag PQR.ALI.OBD 使用有效标签PQR.ALI.OBD更新它
  • if the CSV contains PQR.ALI.OBA , this is invalid, just add suffix invalid and update it PQR.ALI.OBA-invalid . 如果CSV包含PQR.ALI.OBA ,则此无效,只需添加后缀invalid并将其更新为PQR.ALI.OBA-invalid

I'm trying to figure out the best possible way to do this. 我正在尝试找出最佳方法。

My current approach is: 我当前的方法是:

  1. Iterate through each column in CSV, get the tagValue 遍历CSV中的每一列,获取tagValue
  2. Now check if our tagValue contains any of the string from list 现在检查我们的tagValue是否包含列表中的任何字符串
  3. If it contains but is not exactly the same, update it with the value it contains. 如果包含但不完全相同,则使用包含的值对其进行更新。
  4. If it doesnt "contain" any value from the list, add suffix-invalid. 如果它不“包含”列表中的任何值,请添加后缀无效。

Is there any better/more efficient way to do this? 有没有更好/更有效的方法来做到这一点?

Update: 更新:

The list has only 5 items, I have shown three here. 该列表只有5个项目,这里显示了三个项目。 The extra chars are only at the end, and that's happening because people are editing those CSVs in excel web version and that messes up some entries. 多余的字符仅在末尾出现,这是因为人们正在用excel网络版本编辑这些CSV,并且弄乱了某些条目。

My current code: (I'm sure there is a better way to do this, also new at C# so please tell me how I can improve this). 我当前的代码:(我确定有更好的方法可以做到这一点,这也是C#的新功能,所以请告诉我如何改进它)。 I'm using CSVHelper to get CSV cells. 我正在使用CSVHelper来获取CSV单元格。

var record = csv.GetRecord<Record>();
string tag = csv.GetField(10); //tag column number in CSV is 10
/* Criteria for validation:
* tag matches our list, but has extraneous chars - strip extraneous chars and update csv
* tag doesn't match our list - add suffix invalid.*/
int listIndex = 0;
bool valid;
foreach (var validTags in ValidTags) //ValidTags is the enum above
{
    if (validTags.Contains(tag.ToUpper()) && !string.Equals(validTags, subjectIdentifier.ToUpper()))
    {
     valid = true;
     continue; //move on to next csv row.
    //this means that tag is valid but has some extra characters appended to it because of web excel, strip extra charts

    }
    listIndex++; 
    if(listIndex == 3 && !valid) { 
     //means we have reached the end of the list but not found valid tag 
     //add suffix invalid and move on to next csv row
    }
}

Since you say that the extra characters are only at the end, and assuming that the original tag is still present before the extra characters, you could just search the list for each tag to see if the tag contains an entry from the list. 由于您说附加字符仅在末尾,并且假定原始标签仍在附加字符之前,因此您可以在列表中搜索每个标签,以查看该标签是否包含列表中的条目。 If it does, then update it to the correct entry if it's not an exact match, and if it doesn't, append the "-invalid" tag to it. 如果匹配,则将其更新为正确的条目(如果不完全匹配),如果不完全匹配,则将“ -invalid”标签附加到该条目。

Before doing this, we may need to first sort the list Descending so that when we're searching we find the closest (longest) match (in a case where one item in the list begins with another item in the list). 在执行此操作之前,我们可能需要首先对列表Descending进行排序,以便在搜索时找到最接近(最长)的匹配项(如果列表中的一项以列表中的另一项开头)。

var csvPath = @"f:\public\temp\temp.csv";
var entriesUpdated = 0;

// Order the list so we match on the most similar match (ABC.DEF before ABC)
var orderedTags = ValidTags.OrderByDescending(t => t);
var newFileLines = new List<string>();

// Read each line in the file
foreach (var csvLine in File.ReadLines(csvPath))
{
    // Get the columns
    var columns = csvLine.Split(',');

    // Process each column
    for (int index = 0; index < columns.Length; index++)
    {
        var column = columns[index];

        switch (index)
        {
            case 0: // tag column
                var correctTag = orderedTags.FirstOrDefault(tag =>
                    column.IndexOf(tag, StringComparison.OrdinalIgnoreCase) > -1);

                if (correctTag != null)
                {
                    // This item contains a correct tag, so 
                    // update it if it's not an exact match
                    if (column != correctTag)
                    {
                        columns[index] = correctTag;
                        entriesUpdated++;
                    }
                }
                else
                {
                    // This column does not contain a correct tag, so mark it as invalid
                    columns[index] += "-invalid";
                    entriesUpdated++;
                }

                break;

            // Other cases for other columns follow if needed
        }
    }

    newFileLines.Add(string.Join(",", columns));
}

// Write the new lines if any were changed
if (entriesUpdated > 0) File.WriteAllLines(csvPath, newFileLines);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM