简体   繁体   English

使用strtok函数在C中拆分字符串

[英]String split in C with strtok function

I'm trying to do split some strings by {white_space} symbol. 我正在尝试用{white_space}符号分割一些字符串。 btw, there is a problem within some splits. 顺便说一句,在某些分裂中存在问题。 which means, I want to split by {white_space} symbol but also quoted sub-strings. 这意味着,我想用{white_space}符号分割,但还要引用子字符串。

example, 例,

char *pch;
char str[] = "hello \"Stack Overflow\" good luck!";
pch = strtok(str," ");
while (pch != NULL)
{
    printf ("%s\n",pch);
    pch = strtok(NULL, " ");
}

This will give me 这会给我

hello
"Stack
Overflow"
good
luck!

But What I want, as you know, 但是我想要的,如你所知,

hello
Stack Overflow
good
luck!

Any suggestion or idea please? 有什么建议或想法吗?

You'll need to tokenize twice. 您需要两次标记化。 The program flow you currently have is as follows: 您当前拥有的程序流程如下:

1) Search for space 1)搜索空间

2) Print all characters prior to space 2)在空格之前打印所有字符

3) Search for next space 3)搜索下一个空间

4) Print all characters between last space, and this one. 4)打印最后一个空格和该空格之间的所有字符。

You'll need to start thinking in a different matter, two layers of tokenization. 您需要开始思考另一件事,即两层标记化。

  1. Search for Quotation Mark 搜索引号
  2. On odd-numbered strings, perform your original program (search for spaces) 在奇数字符串上,执行原始程序(搜索空格)
  3. On even-numbered strings, print blindly 在偶数字符串上,盲目打印

In this case, even numbered strings are (ideally) within quotes. 在这种情况下,偶数编号的字符串(理想情况下)应放在引号内。 ab"cd"ef would result in ab being odd, cd being even... etc. ab“ cd” ef将导致ab为奇数,cd为偶数...等等。

The other side, is remembering what you need to do, and what you're actually looking for (in regex) is "[a-zA-Z0-9 \\t\\n]*" or, [a-zA-Z0-9]+. 另一面,是记住您需要做的事情,而您实际正在寻找的(在正则表达式中)是“ [a-zA-Z0-9 \\ t \\ n] *”或[a-zA-Z0- 9] +。 That means the difference between the two options, are whether it's separated by quotes. 这意味着两个选项之间的区别在于是否用引号将其分开。 So separate by quotes, and identify from there. 因此,请用引号将其分开,并从中识别。

Try altering your strategy. 尝试改变策略。

Look at non-white space things, then when you find quoted string you can put it in one string value. 查看非空格的东西,然后在找到带引号的字符串时,可以将其放在一个字符串值中。

So, you need a function that examines characters, between white space. 因此,您需要一个在空白之间检查字符的函数。 When you find '"' you can change the rules and hoover everything up to a matching '"' . 当您找到'"'您可以更改规则并将所有内容悬停在匹配的'"' If this function returns a TOKEN value and a value (the string matched) then what calls it, can decide to do the correct output. 如果此函数返回一个TOKEN值和一个值(匹配的字符串),则调用它的对象可以决定进行正确的输出。 Then you have written a tokeniser, and there actually exist tools to generate them called "lexers" as they are used widely, to implement programming languages/config files. 然后,您编写了标记程序,并且实际上存在一些工具来生成它们(称为“词法分析器”),因为它们被广泛使用以实现编程语言/配置文件。

Assuming nextc reads next char from string, begun by firstc( str) : 假设nextc从字符串中读取下一个char,由firstc(str)开始:

for (firstc( str); ((c = nextc) != NULL;) {
    if (isspace(c))
        continue;
    else if (c == '"')
        return readQuote;       /* Handle Quoted string */
    else
        return readWord;        /* Terminated by space & '"' */
}
return EOS;

You'll need to define return values for EOS, QUOTE and WORD, and a way to get the text in each Quote or Word. 您需要定义EOS,QUOTE和WORD的返回值,以及一种在每个Quote或Word中获取文本的方法。

Here's the code that works... in C 这是在C中工作的代码

The idea is that you first tokenize the quote, since that's a priority (if a string is inside the quotes than we don't tokenize it, we just print it). 想法是您首先标记引号,因为这是优先级(如果引号内有字符串,而不是不标记的话,我们只打印它)。 And for each of those tokenized strings, we tokenize within that string on the space character, but we do it for alternate strings, because alternate strings will be in and out of the quotes. 对于每个标记化的字符串,我们在空格字符上的该字符串内进行标记化,但是我们对替代字符串进行标记化,因为替代字符串将在引号内和引号外。

#include <stdio.h>
#include <string.h>
#include <stdbool.h>

int main() {
  char *pch1, *pch2, *save_ptr1, *save_ptr2;
  char str[] = "hello \"Stack Overflow\" good luck!";
  pch1 = strtok_r(str,"\"", &save_ptr1);
  bool in = false;
  while (pch1 != NULL) {
    if(in) {
      printf ("%s\n", pch1);
      pch1 = strtok_r(NULL, "\"", &save_ptr1);
      in = false;
      continue;
    }
    pch2 = strtok_r(pch1, " ", &save_ptr2);
    while (pch2 != NULL) {
      printf ("%s\n",pch2);
      pch2 = strtok_r(NULL, " ", &save_ptr2);
    }
    pch1 = strtok_r(NULL, "\"", &save_ptr1);
    in = true;
  }
}

References 参考

Here it is in C++. 它在C ++中。 I am sure it can be written more elegantly, but it works and is a start: 我相信它可以写得更优美,但是它是可行的并且是一个开始:

#include <iostream>
#include <stdexcept>
#include <vector>
#include <string>

using namespace std;

using Tokens = vector<string>;


Tokens split(string const & sentence) {
  Tokens tokens;
  // indexes to split on
  string::size_type from = 0, to;

  // true if we are inside quotes: we don't split by spaces and we expect a closing quote
  // false otherwise
  bool in_quotes = false;

  while (true) {
    // compute to index
    if (!in_quotes) {
      // find next space or quote
      to = sentence.find_first_of(" \"", from);
      if (to != string::npos && sentence[to] == '\"') {
        // we found an opening quote
        in_quotes = true;
      }
    } else {
      // find next quote (ignoring spaces)
      to = sentence.find('\"', from);
      if (to == string::npos) {
        // no enclosing quote found, invalid string
        throw invalid_argument("missing enclosing quotes");
      }
      in_quotes = false;
    }
    // skip empty tokens
    if (from != to) {
      // get token
      // last token
      if (to == string::npos) {
        tokens.push_back(sentence.substr(from));
        break;
      }
      tokens.push_back(sentence.substr(from, to - from));
    }
    // move from index
    from = to + 1;
  }
  return tokens;
}

test it: 测试一下:

void splitAndPrint(string const & sentence) {
  Tokens tokens;
  cout << "-------------" << endl;
  cout << sentence << endl;
  try {
    tokens = split(sentence);
  } catch (exception &e) {
    cout << e.what() << endl;
    return;
  }
  for (const auto &token : tokens) {
    cout << token << endl;
  }
  cout << endl;
}

int main() {
  splitAndPrint("hello \"Stack Overflow\" good luck!");
  splitAndPrint("hello \"Stack Overflow\" good luck from \"User Name\"");
  splitAndPrint("hello and good luck!");
  splitAndPrint("hello and \" good luck!");

  return 0;
}

output: 输出:

-------------
hello "Stack Overflow" good luck!
hello
Stack Overflow
good
luck!

-------------
hello "Stack Overflow" good luck from "User Name"
hello
Stack Overflow
good
luck
from
User Name

-------------
hello and good luck!
hello
and
good
luck!

-------------
hello and " good luck!
missing enclosing quotes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM