简体   繁体   English

逐字阅读文本文件

[英]Reading a text file word by word

I have a text file containing just lowercase letters and no punctuation except for spaces. 我有一个文本文件,只包含小写字母,除空格外没有标点符号。 I would like to know the best way of reading the file char by char, in a way that if the next char is a space, it signifies the end of one word and the start of a new word. 我想知道通过char读取文件char的最佳方法,如果下一个char是空格,它表示一个单词的结尾和一个新单词的开头。 ie as each character is read it is added to a string, if the next char is space, then the word is passed to another method and reset until the reader reaches the end of the file. 即,当每个字符被读取时,它被添加到字符串中,如果下一个字符是空格,则该字被传递给另一个方法并重置,直到读者到达文件的末尾。

I'm trying to do this with a StringReader, something like this: 我正在尝试使用StringReader执行此操作,如下所示:

public String GetNextWord(StringReader reader)
{
    String word = "";
    char c;
    do
    {
        c = Convert.ToChar(reader.Read());
        word += c;
    } while (c != ' ');
    return word;
}

and put the GetNextWord method in a while loop till the end of the file. 并将GetNextWord方法放在while循环中直到文件结束。 Does this approach make sense or are there better ways of achieving this? 这种方法有意义还是有更好的方法来实现这一目标?

There is a much better way of doing this: string.Split() : if you read the entire string in, C# can automatically split it on every space: 有一个更好的方法: string.Split() :如果你读取整个字符串,C#可以自动在每个空间分割它:

string[] words = reader.ReadToEnd().Split(' ');

The words array now contains all of the words in the file and you can do whatever you want with them. words数组现在包含文件中的所有单词,您可以随意使用它们。

Additionally, you may want to investigate the File.ReadAllText method in the System.IO namespace - it may make your life much easier for file imports to text. 此外,您可能希望调查System.IO命名空间中的File.ReadAllText方法 - 它可以使文件导入文本的生活更轻松。

Edit: I guess this assumes that your file is not abhorrently large; 编辑:我想这假设您的文件不是很大; as long as the entire thing can be reasonably read into memory, this will work most easily. 只要整个事物可以合理地读入内存,这将最容易。 If you have gigabytes of data to read in, you'll probably want to shy away from this. 如果你有数千兆字节的数据要读,你可能会想回避这一点。 I'd suggest using this approach though, if possible: it makes better use of the framework that you have at your disposal. 我建议尽可能使用这种方法:它可以更好地利用您拥有的框架。

If you're interested in good performance even on very large files, you should have a look at the new(4.0) MemoryMappedFile -Class . 如果你对即使在非常大的文件上也有良好的性能感兴趣,你应该看看新的(4.0) MemoryMappedFile -Class

For example: 例如:

using (var mappedFile1 = MemoryMappedFile.CreateFromFile(filePath))
{
    using (Stream mmStream = mappedFile1.CreateViewStream())
    {
        using (StreamReader sr = new StreamReader(mmStream, ASCIIEncoding.ASCII))
        {
            while (!sr.EndOfStream)
            {
                var line = sr.ReadLine();
                var lineWords = line.Split(' ');
            }
        }  
    }
}

From MSDN: 来自MSDN:

A memory-mapped file maps the contents of a file to an application's logical address space. 内存映射文件将文件内容映射到应用程序的逻辑地址空间。 Memory-mapped files enable programmers to work with extremely large files because memory can be managed concurrently, and they allow complete, random access to a file without the need for seeking. 内存映射文件使程序员能够处理非常大的文件,因为可以同时管理内存,并且它们允许完全随机访问文件而无需搜索。 Memory-mapped files can also be shared across multiple processes. 内存映射文件也可以跨多个进程共享。

The CreateFromFile methods create a memory-mapped file from a specified path or a FileStream of an existing file on disk. CreateFromFile方法从指定路径或磁盘上现有文件的FileStream创建内存映射文件。 Changes are automatically propagated to disk when the file is unmapped. 取消映射文件时,更改会自动传播到磁盘。

The CreateNew methods create a memory-mapped file that is not mapped to an existing file on disk; CreateNew方法创建一个未映射到磁盘上现有文件的内存映射文件; and are suitable for creating shared memory for interprocess communication (IPC). 适用于为进程间通信(IPC)创建共享内存。

A memory-mapped file is associated with a name. 内存映射文件与名称相关联。

You can create multiple views of the memory-mapped file, including views of parts of the file. 您可以创建内存映射文件的多个视图,包括文件各部分的视图。 You can map the same part of a file to more than one address to create concurrent memory. 您可以将文件的同一部分映射到多个地址以创建并发内存。 For two views to remain concurrent, they have to be created from the same memory-mapped file. 要使两个视图保持并发,必须从同一个内存映射文件创建它们。 Creating two file mappings of the same file with two views does not provide concurrency. 使用两个视图创建同一文件的两个文件映射不提供并发性。

First of all: StringReader reads from a string which is already in memory. 首先: StringReader从已经在内存中的字符串中读取。 This means that you will have to load up the input file in its entirety before being able to read from it, which kind of defeats the purpose of reading a few characters at a time; 这意味着您必须完整地加载输入文件才能从中读取,这种方法一次性读取几个字符的目的; it can also be undesirable or even impossible if the input is very large. 如果输入非常大,它也可能是不合需要的,甚至是不可能的。

The class to read from a text stream (which is an abstraction over a source of data) is StreamReader , and you would might want to use that one instead. 从文本 (对数据源进行抽象)读取的类是StreamReader ,您可能希望使用该类。 Now StreamReader and StringReader share an abstract base class TextReader , which means that if you code against TextReader then you can have the best of both worlds. 现在, StreamReaderStringReader共享一个抽象基类TextReader ,这意味着如果您针对TextReader进行编码,那么您可以充分利用这两个世界。

TextReader 's public interface will indeed support your example code, so I 'd say it's a reasonable starting point. TextReader的公共接口确实会支持你的示例代码,所以我认为这是一个合理的起点。 You just need to fix the one glaring bug: there is no check for Read returning -1 (which signifies the end of available data). 你只需要修复一个明显的错误:没有检查Read returns -1(表示可用数据的结束)。

If you want to read it whitout spliting the string - for example lines are too long, so you might encounter OutOfMemoryException, you should do it like this (using streamreader): 如果你想通过分割字符串来读取它 - 例如行太长,所以你可能会遇到OutOfMemoryException,你应该这样做(使用streamreader):

while (sr.Peek() >= 0)
{
    c = (char)sr.Read();
    if (c.Equals(' ') || c.Equals('\t') || c.Equals('\n') || c.Equals('\r'))
    {
        break;
    }
    else
        word += c;
}
return word;

All in one line, here you go (assuming ASCII and perhaps not a 2gb file): 所有在一行中,你去(假设ASCII,也许不是2GB文件):

var file = File.ReadAllText(@"C:\myfile.txt", Encoding.ASCII).Split(new[] { ' ' });

This returns a string array, which you can iterate over and do whatever you need with. 这将返回一个字符串数组,您可以迭代它并执行您需要的任何操作。

I created a simple console program on your exact requirement with the files you mentioned, It should be easy to run and check. 我根据您提到的文件创建了一个简单的控制台程序,它应该很容易运行和检查。 Please find attached the code. 请查找随附的代码。 Hope this helps 希望这可以帮助

static void Main(string[] args)
    {

        string[] input = File.ReadAllLines(@"C:\Users\achikhale\Desktop\file.txt");
        string[] array1File = File.ReadAllLines(@"C:\Users\achikhale\Desktop\array1.txt");
        string[] array2File = File.ReadAllLines(@"C:\Users\achikhale\Desktop\array2.txt");

        List<string> finalResultarray1File = new List<string>();
        List<string> finalResultarray2File = new List<string>();

        foreach (string inputstring in input)
        {
            string[] wordTemps = inputstring.Split(' ');//  .Split(' ');

            foreach (string array1Filestring in array1File)
            {
                string[] word1Temps = array1Filestring.Split(' ');

                var result = word1Temps.Where(y => !string.IsNullOrEmpty(y) && wordTemps.Contains(y)).ToList();

                if (result.Count > 0)
                {
                    finalResultarray1File.AddRange(result);
                }

            }

        }

        foreach (string inputstring in input)
        {
            string[] wordTemps = inputstring.Split(' ');//  .Split(' ');

            foreach (string array2Filestring in array2File)
            {
                string[] word1Temps = array2Filestring.Split(' ');

                var result = word1Temps.Where(y => !string.IsNullOrEmpty(y) && wordTemps.Contains(y)).ToList();

                if (result.Count > 0)
                {
                    finalResultarray2File.AddRange(result);
                }

            }

        }

        if (finalResultarray1File.Count > 0)
        {
            Console.WriteLine("file array1.txt contians words: {0}", string.Join(";", finalResultarray1File));
        }

        if (finalResultarray2File.Count > 0)
        {
            Console.WriteLine("file array2.txt contians words: {0}", string.Join(";", finalResultarray2File));
        }

        Console.ReadLine();

    }
}

This code will extract words from a text file based on the Regex pattern. 此代码将根据Regex模式从文本文件中提取单词。 You can try playing with other patterns to see what works best for you. 您可以尝试使用其他模式来查看最适合您的模式。

    StreamReader reader =  new StreamReader(fileName);

    var pattern = new Regex(
              @"( [^\W_\d]              # starting with a letter
                                        # followed by a run of either...
                  ( [^\W_\d] |          #   more letters or
                    [-'\d](?=[^\W_\d])  #   ', -, or digit followed by a letter
                  )*
                  [^\W_\d]              # and finishing with a letter
                )",
              RegexOptions.IgnorePatternWhitespace);

    string input = reader.ReadToEnd();

    foreach (Match m in pattern.Matches(input))
        Console.WriteLine("{0}", m.Groups[1].Value);

    reader.Close();       

This is method that will split your words, while they are separated by space or more than 1 space (two spaces for example)/ 这是分割你的单词的方法,当它们被空格或超过1个空格(例如两个空格)分开时

StreamReader streamReader = new StreamReader(filePath); //get the file
string stringWithMultipleSpaces= streamReader.ReadToEnd(); //load file to string
streamReader.Close();

Regex r = new Regex(" +"); //specify delimiter (spaces)
string [] words = r.Split(stringWithMultipleSpaces); //(convert string to array of words)

foreach (String W in words)
{
   MessageBox.Show(W);
}

I would do something like this: 我会做这样的事情:

IEnumerable<string> ReadWords(StreamReader reader)
{
    string line;
    while((line = reader.ReadLine())!=null)
    {
        foreach(string word in line.Split(new [1] {' '}, StringSplitOptions.RemoveEmptyEntries))
        {
            yield return word;
        }
    }
}

If to use reader.ReadAllText it loads the entire file into your memory so you can get OutOfMemoryException and a lot of other problems. 如果要使用reader.ReadAllText,它会将整个文件加载到您的内存中,这样您就可以获得OutOfMemoryException和许多其他问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM