简体   繁体   English

仅从文件读取字母字符-C ++

[英]Reading alphabetical characters only from file - c++

I am to read words from a text file. 我要从文本文件中读取单词。 Word is defined as a consecutive sequence of letters. 单词被定义为连续的字母序列。 So for example in the following string: 因此,例如以下字符串:

"It's a ver5y good #” idea of a line. You know it?" “这是一个非常好的#”想法。知道吗?

the words are: 这些词是:

it sa ver y good idea of line you know 这是一个很好的主意,你知道

('it' and 'a' are doubled) (“ it”和“ a”加倍)

I was wondering, if there's any clever function that reads words until it finds a non-alphabetical character? 我想知道,是否有任何聪明的功能可以在找到非字母字符之前读取单词? Or the only way to do it is to read char by char and use push_back until we find non-alphabetical one? 还是唯一的方法就是逐个读取char并使用push_back直到找到非字母顺序的字符?

When you read a string from a stream, the stream reads a contiguous run of non-white-space characters as the string. 从流中读取字符串时,流将读取连续的非空白字符作为字符串。 It then ignores any white-space characters. 然后,它将忽略任何空格字符。 The next non-white-space character is the beginning of the next string it'll read. 下一个非空格字符是它将读取的下一个字符串的开头。 This is pretty much the behavior you want, with one more exception: you want everything other than letters to be treated like white-space. 这几乎就是您想要的行为,还有一个例外:您希望将除字母以外的所有内容都视为空白。

Fortunately, the stream doesn't hard-code its idea of what's "white space". 幸运的是,流并没有对“空白”的概念进行硬编码。 It uses a locale to tell it what's white space. 它使用语言环境来告诉它什么是空白。 A locale, in turn, is composed of pieces that deal with individual aspects ("facets") of localization. 反过来,语言环境由处理本地化的各个方面(“方面”)的部分组成。 The facet that deal specifically with classifying characters is a ctype facet. 专门处理字符分类的方面是ctype方面。 So, if we write a ctype facet that classifies everything other than a letter as white space, we can read "words" from the stream quite easily. 因此,如果我们编写一个ctype构面,将除字母以外的所有内容都归类为空白,我们可以很容易地从流中读取“单词”。

Here's some code to do exactly that: 这是一些代码来做到这一点:

struct alpha_only: std::ctype<char> {

    alpha_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table() {
        static std::vector<std::ctype_base::mask> 
            rc(std::ctype<char>::table_size,std::ctype_base::space);

        std::fill(&rc['a'], &rc['z'], std::ctype_base::lower);
        std::fill(&rc['A'], &rc['Z'], std::ctype_base::upper);
        return &rc[0];
    }
};

The char specialization of a ctype facet is (always) table driven. ctype构面的char特殊化(总是)由表驱动。 All we really have to do is create a table that classifies characters properly. 我们真正要做的就是创建一个表,对表中的字符进行正确分类。 In this case, that means alphabetical characters are classified as upper- or lower-case, and everything else is classified as white-space. 在这种情况下,这意味着字母字符被分类为大写或小写,而所有其他字符被分类为空白。 We do that by filling the table with ctype_base::space , then for the alphabetical characters basically saying: "oops, no that's not white-space, that's upper- or lower-case. 为此,我们用ctype_base::space填充表,然后对于基本字母字符说:“糟糕,不是空格,是大写或小写。

Technically, the way I've done that is slightly incorrect--it assumes that upper-case and lower-case letters are contiguous. 从技术上讲,我所做的方式有些不正确-假定大写字母和小写字母是连续的。 This is true of any sane character set, but not of EBCDIC. 对于任何理智的字符集都是如此,但对于EBCDIC而言则不是。 If we wanted to be technically correct, instead of the two "std::fill" calls, we could write a loop something like this: 如果我们想在技术上是正确的,则可以编写如下所示的循环来代替两个“ std :: fill”调用:

auto max = std::numeric_limits<unsigned char>::max();

for (int i=0; i<max; i++)
    if (islower(i))
        table[i] = std::ctype_base::lower;
    else if (isupper(i))
        table[i] = std::ctype_base::upper;
    else
        table[i] = std::ctype_base::space;

Either way, the conclusion is fairly simple: upper case is upper case, lower case is lower case, everything else is "white space". 无论哪种方式,结论都非常简单:大写字母是大写字母,小写字母是小写字母,其他所有东西都是“空白”。

Once we've written that, we need to tell the stream to use that locale; 编写完这些代码后,我们需要告诉流使用该语言环境。 then we can read our words really easily: 那么我们可以很容易地读懂我们的话:

int main() { 
    std::istringstream infile("It’s a ver5y good #” idea of a line. You know it?");

    // Tell the stream to use our character classifier:
    infile.imbue(std::locale(std::locale(), new alpha_only));

    std::string word;
    while (infile >> word)
        std::cout << word << "\n";
}

[I've put a new-line between each "word" so you can easily see what it's reading as a word.] [我在每个“单词”之间都添加了一条换行符,以便您可以轻松查看它作为一个单词的含义。]

Result: 结果:

It
s
a
ver
y
good
idea
of
a
line
You
know
it

Based on your result in the question, you apparently also only want each word to appear once in the output. 根据问题的结果,您显然也只希望每个单词在输出中出现一次。 To do that, you'd typically insert each word in a set as its read, and only write it to the output if insertion in the set was successful. 为此,通常将每个单词作为集合读入,并仅在成功插入集合时才将其写入输出。

std::unordered_set<std::string> words;
std::string word;

while (infile >> word) 
    if (words.insert(word).second)
        std::cout << word << "\n";

The insert for set and unordered_set returns a pair<iterator, bool> , where the bool indicates whether insertion was successful. setunordered_setinsert返回一个pair<iterator, bool> ,其中bool指示插入是否成功。 If it was previously present, that will fail and return false, so based on that we decide whether to write the word out or not. 如果它以前存在,则将失败并返回false,因此基于此我们决定是否将该单词写出来。

With this modification, it still appears in the output twice--the first instance has the i capitalized, and the second doesn't. 进行此修改后, it仍会在输出中出现两次-第一个实例将i大写,而第二个则没有大写。 To filter that out, you'll need to convert each string entirely to lower-case (or entirely to upper-case) before inserting it into the set. 要过滤掉该字符串,您需要先将每个字符串完全转换为小写(或完全转换为大写),然后再将其插入集合中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM