繁体   English   中英

对文本文件中的字符进行计数/排序

[英]Counting/sorting characters in a text file

我正在尝试编写一个程序来读取文本文件,按字符对其进行排序,并跟踪每个字符在文档中出现的次数。 这就是我到目前为止所拥有的。

class Program
{
    static void Main(string[] args)
    {
        CharFrequency[] Charfreq = new CharFrequency[128];

        try
        {            
        string line;
        System.IO.StreamReader file = new System.IO.StreamReader(@"C:\Users\User\Documents\Visual Studio 2013\Projects\Array_Project\wap.txt");
        while ((line = file.ReadLine()) != null)
        {
            int ch = file.Read();

            if (Charfreq.Contains(ch))
            {

            }     
        }

        file.Close();

        Console.ReadLine();
        }
        catch (Exception e)
        {
            Console.WriteLine("The process failed: {0}", e.ToString());
        }
    }
}

我的问题是,if语句应该包含哪些内容?

我还有一个Charfrequency类,我将在这里包括以防我包含它是有用/必要的(是的,我需要使用数组而不是列表或arraylist)。

public class CharFrequency
{
    private char m_character;
    private long m_count;

    public CharFrequency(char ch)
    {
        Character = ch;
        Count = 0;
    }

    public CharFrequency(char ch, long charCount)
    {
        Character = ch;
        Count = charCount;
    }

    public char Character
    {
        set
        {
            m_character = value;
        }

        get
        {
            return m_character;
        }
    }

    public long Count
    {
        get
        {
            return m_count;
        }
        set
        {
            if (value < 0)
                value = 0;

            m_count = value;
        }
    }

    public void Increment()
    {
        m_count++;

    }

    public override bool Equals(object obj)
    {
        bool equal = false;
        CharFrequency cf = new CharFrequency('\0', 0);

        cf = (CharFrequency)obj;

        if (this.Character == cf.Character)
            equal = true;

        return equal;
    }

    public override int GetHashCode()
    {
        return m_character.GetHashCode();
    }

    public override string ToString()
    {
        String s = String.Format("'{0}' ({1})     = {2}", m_character, (byte)m_character, m_count);

        return s;
    }

}

你不应该使用Contains

首先,您需要初始化Charfreq数组:

CharFrequency[] Charfreq = new CharFrequency[128];

for (int i = 0; i < Charferq.Length; i++)
{
    Charfreq[i] = new CharFrequency((char)i);
}

try

那么你也能

int ch;

// -1 means that there are no more characters to read,
// otherwise ch is the char read
while ((ch = file.Read()) != -1)
{
     CharFrequency cf = new CharFrequency((char)ch);

     // This works because CharFrequency overloads the
     // Equals method, and the Equals method checks only 
     // for the Character property of CharFrequency
     int ix = Array.IndexOf(Charfreq, cf);

     // if there is the "right" charfrequency
     if (ix != -1)
     {
         Charfreq[ix].Increment();
     }     
}

请注意,这不是我编写程序的方式。 这是使程序运行所需的最小更改。

作为旁注,该程序将计算ASCII字符的“频率”(代码<= 127的字符)

CharFrequency cf = new CharFrequency('\0', 0);

cf = (CharFrequency)obj;

这是一个无用的初始化:

CharFrequency cf = (CharFrequency)obj;

CharFrequency足够了,否则你创建一个CharFrequency只是为了丢弃它在下面的行。

字典非常适​​合这样的任务。 您没有说明文件所在的字符集和编码。因此,由于Unicode非常常见,我们假设使用Unicode字符集和UTF-8编码。 (毕竟,它是.NET,Java,JavaScript,HTML,XML等的默认设置。)如果不是这样,那么使用适用的编码读取文件并修复代码,因为您当前在使用UTF-8 StreamReader的。

接下来是迭代“角色”。 然后增加字典中“字符”的计数,如文本中所示。

Unicode确实有一些复杂的功能。 一种是组合字符,其中基本字符可以用变音符号等覆盖。用户将这样的组合视为一个“字符”,或者如Unicode所称的那样,将字形视为字形。 值得庆幸的是,.NET给出的是StringInfo类,它将它们作为“文本元素”进行迭代。

所以,如果你考虑一下,使用数组会非常困难。 您必须在数组之上构建自己的字典。

下面的示例使用Dictionary并使用LINQPad 脚本运行。 在创建字典之后,它会通过一个很好的显示来命令和转储它。

var path = Path.GetTempFileName();
// Get some text we know is encoded in UTF-8 to simplify the code below
// and contains combining codepoints as a matter of example.
using (var web = new WebClient())
{
    web.DownloadFile("http://superuser.com/questions/52671/which-unicode-characters-do-smilies-like-%D9%A9-%CC%AE%CC%AE%CC%83-%CC%83%DB%B6-consist-of", path); 
}
// since the question asks to analyze a file
var content = File.ReadAllText(path, Encoding.UTF8); 
var frequency = new Dictionary<String, int>();
var itor = System.Globalization.StringInfo.GetTextElementEnumerator(content);
while (itor.MoveNext()) 
{
    var element = (String)itor.Current;
    if (!frequency.ContainsKey(element)) 
    {
        frequency.Add(element, 0);
    }
    frequency[element]++;
}
var histogram = frequency
    .OrderByDescending(f => f.Value)
    // jazz it up with the list of codepoints in each text element
    .Select(pair =>  
        {
            var bytes = Encoding.UTF32.GetBytes(pair.Key);
            var codepoints = new UInt32[bytes.Length/4];
            Buffer.BlockCopy(bytes, 0, codepoints, 0, bytes.Length);
            return new { 
                Count = pair.Value, 
                textElement = pair.Key, 
                codepoints = codepoints.Select(cp => String.Format("U+{0:X4}", cp) ) };
        });
histogram.Dump(); // For use in LINQPad

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM