简体   繁体   中英

Count number of tabs per line of text file

I am trying to verify there is a correct amount of tabs in a large text file (8,000,000+ rows) before importing to SQL Server.

I think I need to do something like this:

int count = 0;
char tab = "\t";
foreach(char c in tab) 
{
    if(char.IsTab(c)) 
    {
        count++;
    }
}

However, this is not correct. I need to do this to verify that the file is in the proper format.

With Linq you can get your bad rows like this:

using System;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    public static void Main()
    {
        int expectedNumberOfTabs = 5;

        List<string> rows = new List<string>
        {
            "col1 \t col2 \t col3 \t col4 \t col5 \t col6",
            "col1 \t col2 \t col3 \t col4 \t col5 \t col6",
            "col1 \t col2 \t col3 \t col4",
            "col1 \t col2 \t col3 \t col4 \t col5 \t col6 \t col7",
            "col1 \t col2 \t col3 \t col4 \t col5 \t col6",
            "col1 \t col2 \t col3 \t col4 \t col5",
            "col1 \t col2 \t col3 \t col4 \t col5 \t col6",
        };

        var badRows = rows.Where(row => row.Count(c => c == '\t') != expectedNumberOfTabs);
        foreach (var badRow in badRows)
        {
            // Fix the bad rows
            Console.WriteLine(badRow);
        }
    }
}

Results:

col1      col2      col3      col4
col1      col2      col3      col4      col5      col6      col7
col1      col2      col3      col4      col5

Now I don't expect you to read all 8,000,000+ rows into memory at once. I think you'd read them in one row at a time and deal with them one at a time, so the line from this snippet there you're really interested in is:

row.Count(c => c == '\t') != expectedNumberOfTabs

Which will identify a "bad" row for you to fix.

Sample Approach

Because you're dealing with a massive amount of data you may want to try copying the lines from the file to a new file, fixing bad lines as you run across them. Once you have your new "fixed" file, delete the original file, and then rename the "fixed" file back to your original file and import it into your database.

using System.IO;
using System.Linq;

public class Program
{
    public static void Main()
    {
        int expectedNumberOfTabs = 5;
        string originalFile = "MyFile.txt";
        string originalFileFixed = "MyFileFixed.txt";

        using (StreamReader sr = new StreamReader(originalFile))
        using (StreamWriter sw = new StreamWriter(originalFileFixed))
        {
            string line = sr.ReadLine();
            if (line.Count(c => c == '\t') != expectedNumberOfTabs)
            {
                // line = ...Fix the line
            }

            sw.WriteLine(line);
        }

        // Delete original file
        File.Delete(originalFile);
        // Rename the fixed file back to the original file
        File.Move(originalFileFixed, originalFile);

        // Import the file
    }
}

With such a large amount of data, you want to avoid loading the whole file into memory at once. Here is a solution that just loads one line of the file into memory at once and counts the tabs in that line. The results are saved to an int[] where each item in the array contains the count of tabs on the corresponding line.

int[] counts = File.ReadLines("myfile.txt")
    .Select(l => l.Count(c => c == '\t'));

UPDATE

If you simply want the total count of all tabs in the file, then you could do:

int sum = File.ReadLines("myfile.txt")
    .Select(l => l.Count(c => c == '\t'))
    .Sum();

This seems to work for me:

int count = 0;
string tab = "te\tst\t\t\t";

foreach(char c in tab.ToCharArray()) 
{
    if (c == '\t')      // there is no char.IsTab() method
    {
        count++;
    }
}
Console.WriteLine(count);

Gives me this result:

4

Your original code was not working because you declared tab as a char , which can't be iterated. I changed it to a string and iterate the string as an array of char s.

This may not be the optimal way, but it is a working way based on your original code.

If you must do it before uploading, you could use a StreamReader so you don't load the entire file to a string. But I guess it could be slow. Maybe you could divide the file in equal chunks and have separate threads process it.

Here's a sequential solution:

            int count = 0;
            using (StreamReader sr = new StreamReader(@"c:\temp\file.txt"))
            {
                count += sr.ReadLine().Count(f => f == '\t');
            }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM