简体   繁体   English

使用C ++在文件中进行字符串搜索/索引

[英]String search/ indexing in a file using C++

I am using the following code which searches the file and provides the data and associated line number. 我正在使用以下代码搜索文件,并提供数据和关联的行号。 But is this code fast enough in case of hundreds of thousands of lines? 但是,对于成千上万的行,此代码是否足够快? My PC literally froze for a few seconds. 我的电脑实际上冻结了几秒钟。 I need to search pair of integers and return its RHS value after comma (some statistical stuff), but with the following code I could able to return the whole line. 我需要搜索一对整数,并在逗号(一些统计资料)后返回其RHS值,但是使用以下代码,我可以返回整行。

  1. Is it good idea in terms of fastness to parse the returned data using split functions and get my RHS value 就牢度而言,使用split函数解析返回的数据并获取我的RHS值是否是个好主意?

OR 要么

  1. Directly get RHS value based on LHS argument. 根据LHS参数直接获得RHS值。 (Well I am unable to do this) (我无法执行此操作)

Can anyone help me in achieving any one of the above two? 谁能帮助我实现以上两个条件中的任何一个?

Here is my code: 这是我的代码:

#include <string>
#include <iostream>
#include <fstream>

    int main()
    {
        std::ifstream file( "index_hyper.txt" ) ;
        std::string search_str = "401" ;
        std::string line ;
        int line_number = 0 ;
        while( std::getline( file, line ) )
        {
            ++line_number ;
            if( line.find(search_str) != std::string::npos )
                std::cout << "line " << line_number << ": " << line << '\n' ;
        }
    }

Here is my content of index_hyper.txt file: 这是我的index_hyper.txt文件的内容:

18,22
20,37
151,61
200,62
156,63
158,64
159,65
153,66
156,67
152,68
154,69
155,56
156,14
157,13
160,122
161,1333
400,455
401,779
402,74
406,71

You can do the work of the code above with: 您可以使用以下代码完成上述代码的工作:

grep -n "^401," index_hyper.txt

If you want to output just the RHS, you can: 如果只想输出RHS,则可以:

grep  "^401," index_hyper.txt | sed "s/[^,]*,//"

If you are on a Windows platform without sed, grep, bash, etc. then you can easily access unix tools by installing cygwin . 如果您在Windows平台上没有sed,grep,bash等,则可以通过安装cygwin轻松访问unix工具。

As a general rule, don't start breaking the string up into smaller pieces (substrings) until you need to. 通常,不要将字符串分解成较小的部分(子字符串),除非需要。 And start by specifying exactly what is wanted: you speak of RHS and LHS, and talk of "get RHS value based on LHS argument". 并从确切说明需要的内容开始:您谈到RHS和LHS,并谈到“基于LHS参数获取RHS值”。 So: do you want an exact match on the first field, a substring match on the first field, or a substring match on the entire line? 那么:您要在第一个字段上完全匹配,还是在第一个字段上有子字符串匹配,还是在整行上有子字符串匹配?

At any rate: once you have the line in line , you can easily separate it into the two fields: 无论如何:一旦你在该行line ,您可以轻松地将其分离成两个字段:

std::string::const_iterator pivot = std::find( line.cbegin(), line.cend(), ',' );

What you do then depends on what your criterion is: 然后,您要做什么取决于您的标准:

if ( pivot - line.cbegin() == search_str.size() &&
        std::equal( line.cbegin(), pivot, search_str.begin() ) ) {
    //  Exact match on first field...
    std::cout << std::string( std::next( pivot ), line.cend() );
}

if ( std::search( line.cbegin(), pivot, search_str.begin(), search_str.end() ) != pivot ) {
    //  Matches substring in first field...
    std::cout << std::string( std::next( pivot ), line.cend() );
}

if ( std::search( line.cbegin(), line.cend(), search_str.begin(), search_str.end() ) != line.cend() ) {
    //  Matches substring in complete line...
    std::cout << std::string( std::next( pivot ), line.end() ); }
}

Of course, you'll need some additional error checking. 当然,您将需要一些其他的错误检查。 What should you do if there isn't a comma in the line (eg pivot == line.end() ), for example? 例如,如果一行中没有逗号(例如, pivot == line.end() ),该怎么办? Or what about extra spaces in the line. 或者行中多余的空格呢? (Your example looks like numbers. Should "401" match only "401" , or also "+401" ?) (您的示例看起来像数字。 "401"应仅匹配"401" ,还是"+401" ?)

Before going any further, you should very carefully specify exactly what the code should do, for all possible inputs. 在进行任何进一步的操作之前,您应该非常仔细地为所有可能的输入准确指定代码应该执行的操作。 (For most possible inputs, of course, the answer will probably be: output an error message with the line number to std::cerr and continue. Being sure to return EXIT_FAILURE in such a case.) (当然,对于大多数可能的输入,答案可能是:将行号输出到std::cerr的错误消息,然后继续。请确保在这种情况下返回EXIT_FAILURE 。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM