简体   繁体   English

从文本文件获取输入并存储到数组中,但文本文件包含超过 20.000 个字符串

[英]Getting input from text file and storing into array but text file contains more than 20.000 strings

Getting inputs from a text file and storing it into an array but text file contains more than 20.000 strings.从文本文件中获取输入并将其存储到数组中,但文本文件包含超过 20.000 个字符串。 I'm trying to read strings from the text file and store them into a huge-sized array.我正在尝试从文本文件中读取字符串并将它们存储到一个巨大的数组中。 How can I do that?我怎样才能做到这一点?

I can not use vectors.我不能使用向量。 Is it possible to do it without using a hash table?不使用 hash 表是否可以做到这一点?

Afterward, I will try to find the most frequently used words using sorting.之后,我将尝试使用排序找到最常用的单词。

You do not need to keep the whole file in memory to count frequency of words.您不需要将整个文件保留在 memory 中来计算单词的频率。 You only need to keep a single entry and some data structure to count the frequencies, for example a std::unordered_map<std::string,unsigned> .您只需要保留一个条目和一些数据结构来计算频率,例如std::unordered_map<std::string,unsigned>

Not tested:未测试:

std::unordered_map<std::string,unsigned> processFileEntries(std::ifstream& file) { 
    std::unordered_map<std::string,unsigned> freq;
    std::string word;
    
    while ( file >> entry ) {
              ++freqs[entry];
    }
    return freq;
}

For more efficient reading or more elaborated processing you could also read chunks of the file (eg 100 words), process chunks, and then continue with the next chunk.为了更有效地阅读或更详细的处理,您还可以读取文件的块(例如 100 个字),处理块,然后继续下一个块。

Assuming you're using C-Style / raw arrays you could do something like:假设您使用的是 C-Style / raw arrays 您可以执行以下操作:

const size_t number_of_entries = count_entries_in_file();

//Make sure we actually have entries
assert(number_of_entries > 0);

std::string* file_entries = new std::string[number_of_entries];

//fill file_entries with the files entries
//...

//release heap memory again, so we don't create a leak

delete[] file_entries;
file_entries = nullptr;

You requirement is to NOT use any standard container like for example a std::vector or a std::unordered_map .您的要求是不要使用任何标准容器,例如std::vectorstd::unordered_map

In this case we need to create a dynamic container by ourself.在这种情况下,我们需要自己创建一个动态容器。 That is not complicated.这并不复杂。 And we can use this even for storing strings.我们甚至可以使用它来存储字符串。 So, I will even not use std::string in my example.所以,我什至不会在我的示例中使用std::string

I created some demo for you with ~700 lines of code .我用大约 700 行 代码为您创建了一些演示。

We will first define the term "capacity".我们将首先定义术语“容量”。 This is the number of elements that could be stored in the container.这是可以存储在容器中的元素数量。 It is the currently available space.这是当前可用的空间。 It has nothing to do, how many elements are really stored in the container.它无关紧要,容器中真正存储了多少元素。

But there is one and the most important functionality of a dynamic container.但是动态容器有一个也是最重要的功能。 It must be able to grow.它必须能够成长。 And this is always necessary, if we want to store add more elements to the container, as its capacity.这总是必要的,如果我们想存储更多的元素到容器中,作为它的容量。

So, if we want to add an additional element at the end of the container, and if the number of elements is >= its capacity, then we need to reallocate bigger memory and then copy all the old elements to the new memory space.所以,如果我们想在容器的末尾添加一个额外的元素,并且元素的数量>=它的容量,那么我们需要重新分配更大的 memory,然后将所有旧元素复制到新的 memory 空间。 For such events, we will usually double the capacity.对于此类活动,我们通常会将容量增加一倍。 This should prevent frequent reallocations and copying activities.这应该防止频繁的重新分配和复制活动。

Let me show you one example for a push_back function, which could be implemented like this:让我向您展示一个push_back function 的示例,可以这样实现:

template <typename T>
void DynamicArray<T>::push_back(const T& d) {               // Add a new element at the end
    if (numberOfElements >= capacity) {                     // Check, if capacity of this dynamic array is big enough
        capacity *= 2;                                      // Obviously not, we will double the capacity
        T* temp = new T[capacity];                          // Allocate new and more memory
        for (unsigned int k = 0; k < numberOfElements; ++k)
            temp[k] = data[k];                              // Copy data from old memory to new memory
        delete[] data;                                      // Release old memory
        data = temp;                                        // And assign newly allocated memory to old pointer
    }
    data[numberOfElements++] = d;                           // And finally, store the given data at the end of the container
}

This is a basic approach.这是一种基本方法。 I use templates in order to be able to store any type in the dynamic array.我使用模板以便能够在动态数组中存储任何类型。

You could get rid of the templates, by deleting all template stuff and replacing "T" with your intended data type.您可以通过删除所有模板内容并将“T”替换为您想要的数据类型来摆脱模板。

But, I would not do that.但是,我不会那样做。 See, how easy we can create a "String" class.看,我们可以多么容易地创建一个“字符串”class。 We just typedef a dynamic array for char s as "String".我们只需将typedef的动态数组char定义为“String”。

using String = DynamicArray<char>;

will give us basic string functionality.将为我们提供基本的字符串功能。 And if we want to have a dynamic array of strings later, we can write:如果我们以后想要一个动态的字符串数组,我们可以这样写:

using StringArray = DynamicArray<String>;

and this gives us a DynamicArray<DynamicArray<char>> .这给了我们一个DynamicArray<DynamicArray<char>> Cool.凉爽的。

For this special classes we can overwrite some operators, which will make the handling and our life even more simple.对于这个特殊的类,我们可以覆盖一些操作符,这将使处理和我们的生活更加简单。

Please look in the provided code请查看提供的 代码


And, to be able to use the container in the typical C++ environment, we can add full iterator capability.而且,为了能够在典型的 C++ 环境中使用容器,我们可以添加完整的迭代器功能。 That makes life even more simple.这让生活变得更加简单。

This needs really some typing effort, but is not complicated.这确实需要一些打字工作,但并不复杂。 And, it will make life really simpler.而且,它将使生活变得非常简单。


You also wanted to create a hash map.您还想创建一个 hash map。 For counting words.用于计算单词。

For that we will create a key/value pair.为此,我们将创建一个键/值对。 The key is the String that we defined above and the value will be the frequency counter.键是我们上面定义的字符串,值是频率计数器。

We implement a hash function which should be carefully selected to work with strings, has a high entropy and give good results for the bucket size of the hash map. We implement a hash function which should be carefully selected to work with strings, has a high entropy and give good results for the bucket size of the hash map.

The hash map itself is a dynamic container. hash map 本身就是一个动态容器。 We will also add iterator functionality to it.我们还将为其添加迭代器功能。


For all this I drafted some 700 lines of code for you.为此,我为您起草了大约 700 行代码。 You can take this as an example for your further studies.你可以以此作为你进一步学习的例子。

It can also be easily enhanced with additional functionality.它还可以通过附加功能轻松增强。

But caveat: I did only some basic tests and I even used raw pointers for owned memory.但需要注意的是:我只做了一些基本的测试,我什至使用了原始指针来拥有 memory。 This can be done in a schoolproject to learn some dynamic memory management, but not in reality.这可以在一个学校项目中学习一些动态的memory管理,但在现实中不行。

Additionlly.另外。 You can replace all this code, by simply using std::string , std::vector and std::unordered_map .您可以通过简单地使用std::stringstd::vectorstd::unordered_map来替换所有这些代码。 Nobody would use such code and reinvent the wheel.没有人会使用这样的代码并重新发明轮子。

But it may give you some ideas on how to implement similar things.但它可能会给你一些关于如何实现类似事情的想法。

Because Stackoverlof limits the answer size to 32000 characters, I will put the main part on github.因为 Stackoverlof 将答案大小限制为 32000 个字符,所以我将主要部分放在 github 上。

Please click here .请点击 这里

I will just show you main so that you can see how easy the solution can be used:我将只向您展示 main 以便您了解该解决方案的易用性:

int main() {

    // Open file and check, if it could be opened
    std::ifstream ifs{ "r:\\test.txt" };
    if (ifs) {

        // Define a dynamic array for strings
        StringArray stringArray{};

        // Use overwritten extraction operator and read all strings from the file to the dynamic array
        ifs >> stringArray;

        // Create a dynamic hash map
        HashMap hm{};

        // Now count the frequency of words
        for (const String& s : stringArray)
            hm[s]++;

        // Put the resulting key/value pairs into a dynamic array
        DynamicArray<Item> items(hm.begin(), hm.end());

        // Sort in descending order by the frequency
        std::sort(items.begin(), items.end(), [](const Item& i1, const Item& i2) { return i1.count > i2.count; });

        // SHow resulton screen
        for (const auto& [string, count] : items) 
            std::cout << std::left << std::setw(20) << string << '\t' << count << '\n';
    }
    else std::cerr << "\n\nError: Could not open source file\n\n";
}

You can use a std::map to get the frequency of each word in your text file.您可以使用std::map来获取文本文件中每个单词的频率。 One example for reference is given below:下面给出一个示例供参考:

#include <iostream>
#include <map>
#include <string>
#include <sstream>
#include <fstream>
int main()
{
    std::ifstream inputFile("input.txt");
    std::map<std::string, unsigned> freqMap;
    std::string line, word; 
    if(inputFile)
    {
        while(std::getline(inputFile, line))//go line by line 
        {
            std::istringstream ss(line);
            
            while(ss >> word)//go word by word 
            {
                ++freqMap[word]; //increment the count value corresponding to the word 
            }
        }
    }
    else 
    {
        std::cout << "input file cannot be opened"<<std::endl;
    }
    
    //print the frequency of each word in the file 
    for(auto myPair: freqMap)
    {
        std::cout << myPair.first << ": " << myPair.second << std::endl;
    }
    return 0;
}

The output of the above program can be seen here .以上程序的output可以看这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM