简体   繁体   中英

Check if list contains a string that matches closely

I'm trying to figure out the most efficient way to implement the following scenario:

I have a list like this:

public static IEnumerable<string> ValidTags = new List<string> {
  "ABC.XYZ",
  "PQR.SUB.UID",
  "PQR.ALI.OBD",
};

I have a huge CSV with multiple columns. One of the column is tags . This column either contains blank values, or one of the above values. The problem is, the tag column may contain values like " ABC.XYZ?@ " ie the valid tags plus some extraneous characters. I need to update such columns with the valid tag, since they " closely match " one of our valid tags.

Example:

  • if the CSV contains PQR.ALI.OBD? update it with the valid tag PQR.ALI.OBD
  • if the CSV contains PQR.ALI.OBA , this is invalid, just add suffix invalid and update it PQR.ALI.OBA-invalid .

I'm trying to figure out the best possible way to do this.

My current approach is:

  1. Iterate through each column in CSV, get the tagValue
  2. Now check if our tagValue contains any of the string from list
  3. If it contains but is not exactly the same, update it with the value it contains.
  4. If it doesnt "contain" any value from the list, add suffix-invalid.

Is there any better/more efficient way to do this?

Update:

The list has only 5 items, I have shown three here. The extra chars are only at the end, and that's happening because people are editing those CSVs in excel web version and that messes up some entries.

My current code: (I'm sure there is a better way to do this, also new at C# so please tell me how I can improve this). I'm using CSVHelper to get CSV cells.

var record = csv.GetRecord<Record>();
string tag = csv.GetField(10); //tag column number in CSV is 10
/* Criteria for validation:
* tag matches our list, but has extraneous chars - strip extraneous chars and update csv
* tag doesn't match our list - add suffix invalid.*/
int listIndex = 0;
bool valid;
foreach (var validTags in ValidTags) //ValidTags is the enum above
{
    if (validTags.Contains(tag.ToUpper()) && !string.Equals(validTags, subjectIdentifier.ToUpper()))
    {
     valid = true;
     continue; //move on to next csv row.
    //this means that tag is valid but has some extra characters appended to it because of web excel, strip extra charts

    }
    listIndex++; 
    if(listIndex == 3 && !valid) { 
     //means we have reached the end of the list but not found valid tag 
     //add suffix invalid and move on to next csv row
    }
}

Since you say that the extra characters are only at the end, and assuming that the original tag is still present before the extra characters, you could just search the list for each tag to see if the tag contains an entry from the list. If it does, then update it to the correct entry if it's not an exact match, and if it doesn't, append the "-invalid" tag to it.

Before doing this, we may need to first sort the list Descending so that when we're searching we find the closest (longest) match (in a case where one item in the list begins with another item in the list).

var csvPath = @"f:\public\temp\temp.csv";
var entriesUpdated = 0;

// Order the list so we match on the most similar match (ABC.DEF before ABC)
var orderedTags = ValidTags.OrderByDescending(t => t);
var newFileLines = new List<string>();

// Read each line in the file
foreach (var csvLine in File.ReadLines(csvPath))
{
    // Get the columns
    var columns = csvLine.Split(',');

    // Process each column
    for (int index = 0; index < columns.Length; index++)
    {
        var column = columns[index];

        switch (index)
        {
            case 0: // tag column
                var correctTag = orderedTags.FirstOrDefault(tag =>
                    column.IndexOf(tag, StringComparison.OrdinalIgnoreCase) > -1);

                if (correctTag != null)
                {
                    // This item contains a correct tag, so 
                    // update it if it's not an exact match
                    if (column != correctTag)
                    {
                        columns[index] = correctTag;
                        entriesUpdated++;
                    }
                }
                else
                {
                    // This column does not contain a correct tag, so mark it as invalid
                    columns[index] += "-invalid";
                    entriesUpdated++;
                }

                break;

            // Other cases for other columns follow if needed
        }
    }

    newFileLines.Add(string.Join(",", columns));
}

// Write the new lines if any were changed
if (entriesUpdated > 0) File.WriteAllLines(csvPath, newFileLines);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM