简体   繁体   English

C-Strtok(),在'\\ n'上分割字符串,但保留定界符

[英]C - Strtok() , split the string on '\n' but keep the delimiter

I have the following problem with my C program. 我的C程序存在以下问题。 Part of it's functionality is to read some text and split it into sentences then write those sentences in a file. 它的部分功能是读取一些文本并将其拆分为句子,然后将这些句子写入文件中。

I used Strtok() to split the chunk of text in sentences (a sentence ends when \\n occurs) however when there is a sentence that just contains the \\n character in a chunk of text like : 我使用Strtok()将句子中的大部分文本分割(当\\ n出现时,句子结束),但是当句子中仅包含\\ n字符时,例如:

////////////////////////////// //////////////////////////////

Hello, this is some sample text 您好,这是一些示例文本
This is the second sentence 这是第二句话

The sentence above is just a new line 上面的句子只是换行
This is the last sentence. 这是最后一句话。

///////////////////////////// /////////////////////////////

The output of the file is as follows : 该文件的输出如下:

0 Hello, this is some sample text 0您好,这是一些示例文本
1 This is the second sentence 1这是第二句话
2 The sentence above is just a new line 2上面的句子只是换行
3 This is the last sentence. 3这是最后一句话。

//////////////////////////////////////////////////// ///////////////////////////////////////////////////// //

While it should be : 虽然应该是:

0 Hello, this is some sample text 0您好,这是一些示例文本
1 This is the second sentence 1这是第二句话
2 2
3 The sentence above is just \\n 3上面的句子只是\\ n
4 This is the last sentence. 4这是最后一句话。

//////////////////////////////////// ////////////////////////////////////

The file holding the strings should function as a log file that's why I have to split the chunk of text in sentences split at \\n and before writing each sentence into the file have an integer in front. 包含字符串的文件应充当日志文件,这就是为什么我必须将文本块拆分为\\ n拆分的句子,并且在将每个句子写入文件之前将其前面带有整数的原因。

This is the code related to this functionality : 这是与此功能相关的代码:

int counter = 0; // Used for counting
const char s[2] = "\n"; // Used for tokenization

// ............

char *token;
      token = strtok(input,s);
      while(token != NULL){
        fprintf(logs, "%d ", counter);
        fprintf(logs, "%s\n" , token); // Add the new line character here since it is removed from the tokenization process
        counter++;
        token = strtok(NULL, s);
      }

// .........

Is there a way to have a special case for when an "empty sentence" (a sentence that is just a \\n character) to handle it properly? 当“空句子”(一个只是\\ n字符的句子)正确处理时,是否有一种特殊情况?

Perhaps another function would work instead of strtok()? 也许另一个函数可以代替strtok()工作?

You should probably use strstr or strchr as the comment suggests, but if your application requires strtok for some reason, you could save off the position of the end of each sentence and determine that multiple newlines ( \\n ) occurred sequentially with pointer arithmetic. 如注释所建议的那样,您可能应该使用strstrstrchr ,但是如果由于某种原因您的应用程序需要strtok ,则可以节省每个句子结尾的位置,并确定多个换行符( \\n )是通过指针算术顺序出现的。

rough untested example code: 未经测试的粗略示例代码:

int counter = 0; // Used for counting
const char* last_sentence;


// ............
      last_sentence = input;
      char *token;
      token = strtok(input,"\n");
      while(token != NULL){
        int i;
        for (i = (token - last_sentence);i > 1; i--){
          // this gets called once for each empty line.
          fprintf(logs, "%d \n", counter++);
        }
        fprintf(logs, "%d %s\n", counter++, token);

        last_sentence = token + strlen(token);
        token = strtok(NULL, "\n");
      }

// .........

EDIT: added example with strchr 编辑:用strchr添加了示例

Using strchr is just as easy, if not easier especially since you only have one delimiter. 使用strchr也很容易,即使不是那么容易,尤其是因为只有一个定界符。 The code below takes your sentences, and splits them out. 下面的代码接受您的句子并将其拆分。 It just prints them, but you could easily extend it for your purposes. 它只是打印它们,但是您可以根据自己的目的轻松扩展它。

#include <stdio.h>
#include <string.h>
const char* sentences = "Hello, this is some sample text\n"
                        "This is the second sentence\n"
                        "\n"
                        "The sentence above is just a new line\n"
                        "This is the last sentence.\n";

void parse(const char* input){
  char *start, *end;
  unsigned count = 0;

  // the cast to (char*) is because i'm going to change the pointer, not because i'm going to change the value.
  start = end = (char*) input; 

  while( (end = strchr(start, '\n')) ){
      printf("%d %.*s", count++, (int)(end - start + 1), start);
      start = end + 1;
  }
}

int main(void){
  parse(sentences);
}

If you are reading your input from a file, you can use a stream (with fopen() ) and use getline() . 如果要从文件读取输入,则可以使用流(带有fopen() )和getline()

Else you can code a function which count the number of \\n , allocate an array of char* , and fill it line by line. 另外,您可以编写一个计算\\n数量,分配char*数组并逐行填充的函数。

EDIT: if you don't want to code it yourself, you can find it easily with some small research 编辑:如果您不想自己编写代码,则可以通过一些小型研究轻松找到它

You included the newline \\n in the delimiter set for strtok . 您已将换行符\\n包含在strtok的定界符集中。

If the input string is a valid read, and if the first call to strtok returns NULL , then it was a blank line which you can then process. 如果输入字符串是有效的读取,并且对strtok的第一次调用返回NULL ,则它是空白行,您可以对其进行处理。

token = strtok(input,s);
if(token == NULL) {
    fprintf(logs, "%d\n", counter);
    counter++;
}
while(token != NULL){                   // the `while` serves as `else`
    fprintf(logs, "%d ", counter);
    fprintf(logs, "%s\n" , token);
    counter++;
    token = strtok(NULL, s);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM