简体   繁体   中英

How to getline() from specific line in a file? C++

I've looked around a bit and have found no definitive answer on how to read a specific line of text from a file in C++. I have a text file with over 100,000 English words, each on its own line. I can't use arrays because they obviously won't hold that much data, and vectors take too long to store every word. How can I achieve this?

PS I found no duplicates of this question regarding C++

while (getline(words_file, word))
{
    my_vect.push_back(word);
}

EDIT:

A commenter below has helped me to realize that the only reason loading a file to a vector was taking so long was because I was debugging. Plainly running the.exe loads the file nearly instantaneously. Thanks for everyones help.

You have a few options, but none will automatically let you go to a specific line. File systems don't track line numbers within files.

One way is to have fixed-width lines in the file. Then read the appropriate amount of data based upon the line number you want and the number of bytes per line.

Another way is to loop, reading lines one a time until you get to the line that you want.

A third way would be to have a sort of index that you create at the beginning of the file to reference the location of each line. This, of course, would require that you control the file format.

If your words have no white-space (I assume they don't), you can use a more tricky non-getline solution using a deque !

using namespace std; 

int main() {
    deque<string> dictionary;

    cout << "Loading file..." << endl;
    ifstream myfile ("dict.txt");
    if ( myfile.is_open() ) {
        copy(istream_iterator<string>(myFile),
             istream_iterator<string>(),
             back_inserter<deque<string>>(dictionary));
        myfile.close();
    } else {
        cout << "Unable to open file." << endl;
    }

    return 0;
}

The above reads the entire file into a string and then tokenizes the string based on the std::stream default (any whitespace - this is a big assumption on my part) which makes it slightly faster. This gets done in about 2-3 seconds with 100,000 words. I'm also using a deque , which is the best data structure (imo) for this particular scenario. When I use vectors, it takes around 20 seconds (not even close to your minute mark -- you must be doing something else that increases complexity).

To access the word at line 1:

cout << dictionary[0] << endl;

Hope this has been useful.

I already mentioned this in a comment, but I wanted to give it a bit more visibility for anyone else who runs into this issue...

I think that the following code will take a long time to read from the file because std::vector probably has to re-allocate its internal memory several times to account for all of these elements that you are adding. This is an implementation detail, but if I understand correctly std::vector usually starts out small and increases its memory as necessary to accommodate new elements. This works fine when you're adding a handful of elements at a time, but is really inefficient when you're adding a thousand elements at once.

while (getline(words_file, word)) {
    my_vect.append(word); }

So, before running the loop above, try to initialize the vector with my_vect(100000) (constructor with the number of elements specified). This forces std::vector to allocate enough memory in advance so that it doesn't need to shuffle things around later.

The question is exceedingly unclear. How do you determine the specific line? If it is the nth line, simplest solution is just to call getline n times, throwing out all but the last results; calling ignore n-1 times might be slightly faster, but I suspect that if you're always reading into the same string (rather than constructing a new one each time), the difference in time won't be enormous. If you have some other criteria, and the file is really big (which from your description it isn't) and sorted, you might try using a binary search, seeking to the middle of the file, reading enough ahead to find the start of the next line, then deciding the next step according to it's value. (I've used this to find relevant entries in log files. But we're talking about files which are several Gigabytes in size.)

If you're willing to use system dependent code, it might be advantageous to memory map the file, then search for the nth occurance of a '\n' ( std::find n times).

ADDED: Just some quick benchmarks. On my Linux box, getting the 100000th word from /usr/share/dict/words (479623 words, one per line, on my machine), takes about

  • 272 milliseconds, reading all words into an std::vector , then indexing,
  • 256 milliseconds doing the same, but with std::deque ,
  • 30 milliseconds using getline , but just ignoring the results until the one I'm interested in,
  • 20 milliseconds using istream::ignore , and
  • 6 milliseconds using mmap and looping on std::find .

FWIW, the code in each case is:

For the std:: containers:

template<typename Container>
void Using<Container>::operator()()
{
    std::ifstream input( m_filename.c_str() );
    if ( !input )
        Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
    Container().swap( m_words );
    std::copy( std::istream_iterator<Line>( input ),
               std::istream_iterator<Line>(),
               std::back_inserter( m_words ) );
    if ( static_cast<int>( m_words.size() ) < m_target )
        Gabi::ProgramManagement::fatal() 
            << "Not enough words, had " << m_words.size()
            << ", wanted at least " << m_target;
    m_result = m_words[ m_target ];
}

For getline without saving:

void UsingReadAndIgnore::operator()()
{
    std::ifstream input( m_filename.c_str() );
    if ( !input )
        Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
    std::string dummy;
    for ( int count = m_target; count > 0; -- count )
        std::getline( input, dummy );
    std::getline( input, m_result );
}

For ignore :

void UsingIgnore::operator()()
{
    std::ifstream input( m_filename.c_str() );
    if ( !input )
        Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
    for ( int count = m_target; count > 0; -- count )
        input.ignore( INT_MAX, '\n' );
    std::getline( input, m_result );
}

And for mmap :

void UsingMMap::operator()()
{
    int input = ::open( m_filename.c_str(), O_RDONLY );
    if ( input < 0 )
        Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
    struct ::stat infos;
    if ( ::fstat( input, &infos ) != 0 )
        Gabi::ProgramManagement::fatal() << "Could not stat " << m_filename;
    char* base = (char*)::mmap( NULL, infos.st_size, PROT_READ, MAP_PRIVATE, input, 0 );
    if ( base == MAP_FAILED )
        Gabi::ProgramManagement::fatal() << "Could not mmap " << m_filename;
    char const* end = base + infos.st_size;
    char const* curr = base;
    char const* next = std::find( curr, end, '\n' );
    for ( int count = m_target; count > 0 && curr != end; -- count ) {
        curr = next + 1;
        next = std::find( curr, end, '\n' );
    }
    m_result = std::string( curr, next );
    ::munmap( base, infos.st_size );
}

In each case, the code is run

You could seek to a specific position, but that requires that you know where the line starts. "A little less than a minute" for 100,000 words does sound slow to me.

Read some data, count the newlines, throw away that data and read some more, count the newlines again... and repeat until you've read enough newlines to hit your target.

Also, as others have suggested, this is not a particularly efficient way of accessing data. You'd be well-served by making an index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM