简体   繁体   English

对文本文件中的字符进行计数/排序

[英]Counting/sorting characters in a text file

I am trying to write a program that reads a text file, sorts it by character, and keeps track of how many times each character appears in the document. 我正在尝试编写一个程序来读取文本文件,按字符对其进行排序,并跟踪每个字符在文档中出现的次数。 This is what I have so far. 这就是我到目前为止所拥有的。

class Program
{
    static void Main(string[] args)
    {
        CharFrequency[] Charfreq = new CharFrequency[128];

        try
        {            
        string line;
        System.IO.StreamReader file = new System.IO.StreamReader(@"C:\Users\User\Documents\Visual Studio 2013\Projects\Array_Project\wap.txt");
        while ((line = file.ReadLine()) != null)
        {
            int ch = file.Read();

            if (Charfreq.Contains(ch))
            {

            }     
        }

        file.Close();

        Console.ReadLine();
        }
        catch (Exception e)
        {
            Console.WriteLine("The process failed: {0}", e.ToString());
        }
    }
}

My question is, what should go in the if statement here? 我的问题是,if语句应该包含哪些内容?

I also have a Charfrequency class, which I'll include here in case it is helpful/necessary that I include it (and yes, it is necessary that I use an array versus a list or arraylist). 我还有一个Charfrequency类,我将在这里包括以防我包含它是有用/必要的(是的,我需要使用数组而不是列表或arraylist)。

public class CharFrequency
{
    private char m_character;
    private long m_count;

    public CharFrequency(char ch)
    {
        Character = ch;
        Count = 0;
    }

    public CharFrequency(char ch, long charCount)
    {
        Character = ch;
        Count = charCount;
    }

    public char Character
    {
        set
        {
            m_character = value;
        }

        get
        {
            return m_character;
        }
    }

    public long Count
    {
        get
        {
            return m_count;
        }
        set
        {
            if (value < 0)
                value = 0;

            m_count = value;
        }
    }

    public void Increment()
    {
        m_count++;

    }

    public override bool Equals(object obj)
    {
        bool equal = false;
        CharFrequency cf = new CharFrequency('\0', 0);

        cf = (CharFrequency)obj;

        if (this.Character == cf.Character)
            equal = true;

        return equal;
    }

    public override int GetHashCode()
    {
        return m_character.GetHashCode();
    }

    public override string ToString()
    {
        String s = String.Format("'{0}' ({1})     = {2}", m_character, (byte)m_character, m_count);

        return s;
    }

}

You shouldn't use Contains 你不应该使用Contains

first you need to initialize your Charfreq array: 首先,您需要初始化Charfreq数组:

CharFrequency[] Charfreq = new CharFrequency[128];

for (int i = 0; i < Charferq.Length; i++)
{
    Charfreq[i] = new CharFrequency((char)i);
}

try

then you can 那么你也能

int ch;

// -1 means that there are no more characters to read,
// otherwise ch is the char read
while ((ch = file.Read()) != -1)
{
     CharFrequency cf = new CharFrequency((char)ch);

     // This works because CharFrequency overloads the
     // Equals method, and the Equals method checks only 
     // for the Character property of CharFrequency
     int ix = Array.IndexOf(Charfreq, cf);

     // if there is the "right" charfrequency
     if (ix != -1)
     {
         Charfreq[ix].Increment();
     }     
}

Note that this isn't the way I would write the program. 请注意,这不是我编写程序的方式。 This is the minimum changes needed to make your program working. 这是使程序运行所需的最小更改。

As a sidenote, this program will count the "frequency" of ASCII characters (characters with code <= 127) 作为旁注,该程序将计算ASCII字符的“频率”(代码<= 127的字符)

CharFrequency cf = new CharFrequency('\0', 0);

cf = (CharFrequency)obj;

And this is an useless initialization: 这是一个无用的初始化:

CharFrequency cf = (CharFrequency)obj;

is enough, otherwise you are creating a CharFrequency just to discard it the line below. CharFrequency足够了,否则你创建一个CharFrequency只是为了丢弃它在下面的行。

A dictionary is well suited for a task like this. 字典非常适​​合这样的任务。 You didn't say which character set and encoding the file was in. So, because Unicode is so common, let's assume the Unicode character set and UTF-8 encoding. 您没有说明文件所在的字符集和编码。因此,由于Unicode非常常见,我们假设使用Unicode字符集和UTF-8编码。 (After all, it is the default for .NET, Java, JavaScript, HTML, XML,….) If that's not the case then read the file using the applicable encoding and fix your code because you currently are using UTF-8 in your StreamReader. (毕竟,它是.NET,Java,JavaScript,HTML,XML等的默认设置。)如果不是这样,那么使用适用的编码读取文件并修复代码,因为您当前在使用UTF-8 StreamReader的。

Next comes iterating across the "characters". 接下来是迭代“角色”。 And then incrementing the count for a "character" in the dictionary as it is seen in the text. 然后增加字典中“字符”的计数,如文本中所示。

Unicode does have a few complex features. Unicode确实有一些复杂的功能。 One is combining characters, where a base character can be overlaid with diacritics etc. Users view such combinations as one "character", or, as Unicode calls them, graphemes. 一种是组合字符,其中基本字符可以用变音符号等覆盖。用户将这样的组合视为一个“字符”,或者如Unicode所称的那样,将字形视为字形。 Thankfully, .NET gives is the StringInfo class that iterates over them as a "text element." 值得庆幸的是,.NET给出的是StringInfo类,它将它们作为“文本元素”进行迭代。

So, if you think about it, using an array would be quite difficult. 所以,如果你考虑一下,使用数组会非常困难。 You'd have to build your own dictionary on top of your array. 您必须在数组之上构建自己的字典。

The example below uses a Dictionary and is runnable using a LINQPad script . 下面的示例使用Dictionary并使用LINQPad 脚本运行。 After it creates the dictionary, it orders and dumps it with a nice display. 在创建字典之后,它会通过一个很好的显示来命令和转储它。

var path = Path.GetTempFileName();
// Get some text we know is encoded in UTF-8 to simplify the code below
// and contains combining codepoints as a matter of example.
using (var web = new WebClient())
{
    web.DownloadFile("http://superuser.com/questions/52671/which-unicode-characters-do-smilies-like-%D9%A9-%CC%AE%CC%AE%CC%83-%CC%83%DB%B6-consist-of", path); 
}
// since the question asks to analyze a file
var content = File.ReadAllText(path, Encoding.UTF8); 
var frequency = new Dictionary<String, int>();
var itor = System.Globalization.StringInfo.GetTextElementEnumerator(content);
while (itor.MoveNext()) 
{
    var element = (String)itor.Current;
    if (!frequency.ContainsKey(element)) 
    {
        frequency.Add(element, 0);
    }
    frequency[element]++;
}
var histogram = frequency
    .OrderByDescending(f => f.Value)
    // jazz it up with the list of codepoints in each text element
    .Select(pair =>  
        {
            var bytes = Encoding.UTF32.GetBytes(pair.Key);
            var codepoints = new UInt32[bytes.Length/4];
            Buffer.BlockCopy(bytes, 0, codepoints, 0, bytes.Length);
            return new { 
                Count = pair.Value, 
                textElement = pair.Key, 
                codepoints = codepoints.Select(cp => String.Format("U+{0:X4}", cp) ) };
        });
histogram.Dump(); // For use in LINQPad

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM