根据连续的定界符分割字符串

Question

我希望根据特定的字符序列来拆分字符串，但前提是它们必须顺序排列。

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
  int i = 0;
  char **split;
  char *tmp;

  split = malloc(20 * sizeof(char *));
  tmp  = malloc(20 * 12 * sizeof(char));
  for(i=0;i<20;i++)
  {
    split[i] = &tmp[12*i];
  }

  char *line;
  line = malloc(50 * sizeof(char));

  strcpy(line, "Test - Number -> <10.0>");
  printf("%s\n", line);
  i = 0;

  while( (split[i] = strsep(&line, " ->")) != NULL)
  {
    printf("%s\n", split[i]);
    i++;
  }
}

这将打印出：

Test 
Number
<10.0

但是我只想在->周围拆分，以便可以提供输出：

Test - Number
<10.0>

Answer 1

我认为，使用有序的分隔符序列进行拆分的最好方法是使用strstr复制strtok_r行为，如下所示：

#include <stdio.h>
#include <string.h>

char *substrtok_r(char *str, const char *substrdelim, char **saveptr)
{
    char *haystack;

    if(str)
        haystack = str;
    else
        haystack = *saveptr;

    char *found = strstr(haystack, substrdelim);

    if(found == NULL)
    {
        *saveptr = haystack + strlen(haystack);
        return *haystack ? haystack : NULL;
    }

    *found = 0;
    *saveptr = found + strlen(substrdelim);

    return haystack;
}


int main(void)
{
    char line[] = "a -> b -> c -> d; Test - Number -> <10.0> ->No->split->here";

    char *input = line;
    char *token;
    char *save;

    while(token = substrtok_r(input, " ->", &save))
    {
        input = NULL;
        printf("token: '%s'\n", token);
    }

    return 0;
}

行为类似于strtok_r但仅在找到子字符串时才拆分。 输出为：

$ ./a 
token: 'a'
token: ' b'
token: ' c'
token: ' d; Test - Number'
token: ' <10.0>'
token: 'No->split->here'

与strtok和strtok_r ，它要求源字符串是可修改的，因为它会写入'\\0'终止字节以创建和返回令牌。

编辑

嗨，您介意解释为什么'*found = 0'意味着返回值只是分隔符之间的字符串。 我真的不明白这里发生了什么或为什么它起作用。 谢谢

您首先要了解的是字符串在C中的工作方式。字符串只是一个字节序列（字符），以'\\0'结尾。 我用括号写了字节和字符，因为C中的字符只是一个1字节的值（在大多数系统上一个字节是8位长），代表字符的整数值是ASSCI代码表中定义的值，即7位长值。 从表中可以看到，值97代表字符'a' ，值98代表'b' ，依此类推。

char x = 'a';

和做的一样

char x = 97;

值0是字符串的特殊值，称为NUL （空字符）或'\\0'终止字节。 此值用于告诉函数字符串在何处结束。 像strlen这样的函数可以返回字符串的长度，它通过计算遇到的字节数直到遇到值为0的字节来完成。

这就是为什么字符串使用存储char数组，因为一个指针数组给予其中的序列的存储器块的开始char s的存储。

让我们看一下：

char string[] = { 'H', 'e', 'l', 'l', 'o', 0, 48, 49, 50, 0 };

该数组的内存布局为

0     1     2     3     4     5    6     7     8     9
+-----+-----+-----+-----+-----+----+-----+-----+-----+----+
| 'H' | 'e' | 'l' | 'l' | 'o' | \0 | '0' | '1' | '2' | \0 |
+-----+-----+-----+-----+-----+----+-----+-----+-----+----+

或者更精确地使用整数值

0    1     2     3     4     5   6    7     8   9   10
+----+-----+-----+-----+-----+---+----+----+----+---+
| 72 | 101 | 108 | 108 | 111 | 0 | 48 | 49 | 50 | 0 |
+----+-----+-----+-----+-----+---+----+----+----+---+

注意，值0表示'\\0' ，48表示'0' ，49表示'1' ，50表示'2' 。 如果你这样做

printf("%lu\n", strlen(string));

输出将为strlen将在第5个位置找到值0并停止计数，但是string存储了两个字符串，因为从第6个位置开始，新的字符序列也以0结尾，因此使其成为第二个字符数组中的有效字符串。 要访问它，您将需要使指针指向第一个0值之后。

printf("1. %s\n", string);
printf("2. %s\n", string + strlen(string) + 1);

输出将是

Hello
012

此属性用于strtok （及上面的mine）等函数中，可从较大的字符串返回子字符串，而无需创建副本（这将创建一个新数组，动态分配内存，使用strcpy创建副本）。

假设您有以下字符串：

char line[] = "This is a sentence;This is another one";

在这里，您只有一个字符串，因为终止字符'\\0' 'e'在字符串的最后一个'e'之后。 但是，如果我这样做：

line[18] = 0;  // same as line[18] = '\0';

然后我在同一数组中创建了两个字符串：

"This is a sentence\0This is another one"

因为我替换了分号';' 与'\\0' ，从而从位置0到18创建一个新字符串，并从位置19到38创建第二个字符串。

printf("string: %s\n", line);

输出将是

string: This is a sentence

现在让我们来看一下函数本身：

char *substrtok_r(char *str, const char *substrdelim, char **saveptr);

第一个参数是源字符串，第二个参数是定界符字符串，第三个参数是char doule指针。 您必须将指针传递给char指针。 这将用于记住该功能应在哪里继续扫描，以后再继续。

这是算法：

if str is not NULL:
    start a new scan sequence from str
otherwise
    resume scanning from string pointed to by *saveptr

found position of substring_d pointed to by 'substrdelim'

if no such substring_d is found
    if the current character of the scanned text is \0
        no more substrings to return --> return NULL
    otherwise
        return the scanned text and set *saveptr to
        point to the \0 character of the scanned text,
        so that the next iteration ends the scanning
        by returning NULL

otherwise (a substring_d was found)

    create a new substring_a until the found one
    by setting the first character of the found
    substring_d to 0.

    update *saveptr to the start of the found substring_d
    plus it's previous length so that *saveptr
    points to the past the delimiter sequence found in substring_d.

    return new created substring_a

第一部分很容易理解：

if(str)
    haystack = str;
else
    haystack = *saveptr;

在这里，如果str不为NULL ， str开始一个新的扫描序列。 这就是为什么在main中将input指针设置为指向line保存的字符串的开头的原因。 其他所有迭代都必须使用str == NULL调用，这就是为什么while循环中要做的第一件事是将input = NULL;设置input = NULL; 因此substrtok_r使用*saveptr恢复扫描。 这是strtok的标准行为。

下一步是查找定界子字符串：

char *found = strstr(haystack, substrdelim);

下一部分处理未找到定界子字符串^2的情况：

if(found == NULL)
{
    *saveptr = haystack + strlen(haystack);
    return *haystack ? haystack : NULL;
}

*saveptr已更新为指向整个源，因此它指向'\\0'终止字节。 返回行可以改写为

if(*haystack == '\0')
    return NULL
else
    return haystack;

这说明如果源已经是一个empy字符串¹ ，则返回NULL 。 这意味着找不到更多子字符串，结束调用该函数。 这也是strtok标准行为。

最后一部分

*found = 0;
*saveptr = found + strlen(substrdelim);

return haystack;

当找到定界子字符串时，is处理。 这里

*found = 0;

基本上在做

found[0] = '\0';

如上所述创建子字符串。 为了清楚起见，在此之前

之前

*found = 0;
*saveptr = found + strlen(substrdelim);

return haystack;

内存看起来像这样：

       +-----+-----+-----+-----+-----+-----+
       | 'a' | ' ' | '-' | '>' | ' ' | 'b' | ...
       +-----+-----+-----+-----+-----+-----+
       ^     ^
       |     |
haystack     found
*saveptr

后

*found = 0;
*saveptr = found + strlen(substrdelim);

内存看起来像这样：

       +-----+------+-----+-----+-----+-----+
       | 'a' | '\0' | '-' | '>' | ' ' | 'b' | ...
       +-----+------+-----+-----+-----+-----+
       ^     ^                  ^
       |     |                  |
haystack     found              *saveptr
                                because strlen(substrdelim)
                                is 3

记住，如果我做printf("%s\\n", haystack); 此时，由于found中的'-'已设置为0，因此它将打印a 。 *found = 0像上面说明的那样从一个创建了两个字符串。 strtok （和我基于strtok函数）使用相同的技术。 所以当函数执行

return haystack;

token的第一个字符串将是拆分之前的令牌。 最终substrtok_r返回NULL并存在循环，因为在无法创建更多拆分时， substrtok_r返回NULL ，就像strtok一样。

脚注

¹空字符串是第一个字符已经是'\\0'终止字节的字符串。

²这是非常重要的部分。 C库中的大多数标准函数（例如strstr都不会在内存中返回新的字符串，也不会创建副本并返回副本（除非文档中如此说明）。 会返回一个指向原始对象的指针，外加一个偏移量。

成功执行后， strstr将返回一个指向子字符串开头的指针，该指针将位于源字符串的偏移量处。

const char *txt = "abcdef";
char *p = strstr(txt, "cd");

在这里， strstr将返回一个指针，该指针指向"abcdef"中的子字符串"cd"的开头。 要获取偏移量，请执行p - txt ，该操作返回有多少字节的appart

b = base address where txt is pointing to

b     b+1   b+2   b+3   b+4   b+5   b+6
+-----+-----+-----+-----+-----+-----+------+
| 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | '\0' |
+-----+-----+-----+-----+-----+-----+------+
^           ^
|           |
txt         p

因此txt指向地址b ， p指向地址b+2 。 这就是为什么通过执行p-txt得到偏移量的原因，这将是(b+2) - b => 2 。 因此， p指向原始地址加上2个字节的偏移量。 由于这种行为， *found = 0; 首先工作。

请注意，执行txt + 2将返回一个新指针，该指针指向txt指向的位置加上偏移量2。这称为指针算术。 就像常规算法一样，但是这里的编译器会考虑对象的大小。 char是定义为大小为1的类型，因此sizeof(char)返回1。但是，假设您有一个整数数组：

int arr[] = { 7, 2, 1, 5 };

在我的系统上，一个int大小为4，因此一个int对象需要4个字节的内存。 该数组在内存中看起来像这样：

b = base address where arr is stored

address       base        base + 4    base + 8    base + 12
in bytes      +-----------+-----------+-----------+-----------+
              |    7      |    2      |    1      |    5      |
              +-----------+-----------+-----------+-----------+
pointer       arr         arr + 1     arr + 2     arr + 3
arithmetic

在这里， arr + 1返回一个指向arr的存储位置的指针，外加4个字节的偏移量。

根据连续的定界符分割字符串

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-02-11 18:45:34

根据连续的定界符分割字符串

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-02-11 18:45:34

解决方案1
2 已采纳 2018-02-11 18:45:34