简体   繁体   English

在 C 中解析文本

[英]Parsing text in C

I have a file like this:我有一个这样的文件:

...
words 13
more words 21
even more words 4
...

(General format is a string of non-digits, then a space, then any number of digits and a newline) (一般格式是一串非数字,然后是空格,然后是任意数量的数字和换行符)

and I'd like to parse every line, putting the words into one field of the structure, and the number into the other.我想解析每一行,将单词放入结构的一个字段,将数字放入另一个字段。 Right now I am using an ugly hack of reading the line while the chars are not numbers, then reading the rest.现在,我正在使用一种丑陋的技巧来阅读该行,而字符不是数字,然后阅读其余部分。 I believe there's a clearer way.我相信有一个更清晰的方法。

Edit: You can use pNum-buf to get the length of the alphabetical part of the string, and use strncpy() to copy that into another buffer.编辑:您可以使用 pNum-buf 获取字符串的字母部分的长度,并使用 strncpy() 将其复制到另一个缓冲区中。 Be sure to add a '\0' to the end of the destination buffer.请务必在目标缓冲区的末尾添加一个 '\0'。 I would insert this code before the pNum++.我会在 pNum++ 之前插入这段代码。

int len = pNum-buf;
strncpy(newBuf, buf, len-1);
newBuf[len] = '\0';

You could read the entire line into a buffer and then use:您可以将整行读入缓冲区,然后使用:

char *pNum;
if (pNum = strrchr(buf, ' ')) {
  pNum++;
}

to get a pointer to the number field.获取指向数字字段的指针。

fscanf(file, "%s %d", word, &value);

This gets the values directly into a string and an integer, and copes with variations in whitespace and numerical formats, etc.这将值直接转换为字符串和整数,并处理空格和数字格式等的变化。

Edit编辑

Ooops, I forgot that you had spaces between the words.哎呀,我忘了你的单词之间有空格。 In that case, I'd do the following.在这种情况下,我会执行以下操作。 (Note that it truncates the original text in 'line') (请注意,它会截断“行”中的原始文本)

// Scan to find the last space in the line
char *p = line;
char *lastSpace = null;
while(*p != '\0')
{
    if (*p == ' ')
        lastSpace = p;
    p++;
}


if (lastSpace == null)
    return("parse error");

// Replace the last space in the line with a NUL
*lastSpace = '\0';

// Advance past the NUL to the first character of the number field
lastSpace++;

char *word = text;
int number = atoi(lastSpace);

You can solve this using stdlib functions, but the above is likely to be more efficient as you're only searching for the characters you are interested in.您可以使用 stdlib 函数解决此问题,但上述方法可能更有效,因为您只搜索您感兴趣的字符。

Given the description, I think I'd use a variant of this (now tested) C99 code:鉴于描述,我想我会使用这个(现已测试)C99代码的变体:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

struct word_number
{
    char word[128];
    long number;
};

int read_word_number(FILE *fp, struct word_number *wnp)
{
    char buffer[140];
    if (fgets(buffer, sizeof(buffer), fp) == 0)
        return EOF;
    size_t len = strlen(buffer);
    if (buffer[len-1] != '\n')  // Error if line too long to fit
        return EOF;
    buffer[--len] = '\0';
    char *num = &buffer[len-1];
    while (num > buffer && !isspace((unsigned char)*num))
        num--;
    if (num == buffer)         // No space in input data
        return EOF;
    char *end;
    wnp->number = strtol(num+1, &end, 0);
    if (*end != '\0')  // Invalid number as last word on line
        return EOF;
    *num = '\0';
    if (num - buffer >= sizeof(wnp->word))  // Non-number part too long
        return EOF;
    memcpy(wnp->word, buffer, num - buffer);
    return(0);
}

int main(void)
{
    struct word_number wn;
    while (read_word_number(stdin, &wn) != EOF)
        printf("Word <<%s>> Number %ld\n", wn.word, wn.number);
    return(0);
}

You could improve the error reporting by returning different values for different problems.您可以通过为不同的问题返回不同的值来改进错误报告。 You could make it work with dynamically allocated memory for the word portion of the lines.您可以使其与行的单词部分的动态分配内存一起使用。 You could make it work with longer lines than I allow.你可以让它使用比我允许的更长的行。 You could scan backwards over digits instead of non-spaces - but this allows the user to write "abc 0x123" and the hex value is handled correctly.您可以向后扫描数字而不是非空格 - 但这允许用户编写“abc 0x123”并且正确处理十六进制值。 You might prefer to ensure there are no digits in the word part;您可能更愿意确保单词部分没有数字; this code does not care.这段代码不在乎。

您可以尝试使用strtok()对每一行进行标记,然后检查每个标记是数字还是单词(一旦有了标记字符串,就可以进行相当简单的检查 - 只需查看标记的第一个字符)。

Assuming that the number is immediately followed by '\n'.假设数字后面紧跟着'\n'。 you can read each line to chars buffer, use sscanf("%d") on the entire line to get the number, and then calculate the number of chars that this number takes at the end of the text string.您可以将每一行读取到字符缓冲区,在整行上使用 sscanf("%d") 来获取数字,然后计算该数字在文本字符串末尾所占用的字符数。

Depending on how complex your strings become you may want to use the PCRE library.根据您的字符串变得多么复杂,您可能需要使用 PCRE 库。 At least that way you can compile a perl'ish regular expression to split your lines.至少这样你就可以编译一个 perl'ish 正则表达式来分割你的行。 It may be overkill though.不过,这可能有点矫枉过正。

Given the description, here's what I'd do: read each line as a single string using fgets() (making sure the target buffer is large enough), then split the line using strtok().鉴于描述,这就是我要做的:使用 fgets() 将每一行作为单个字符串读取(确保目标缓冲区足够大),然后使用 strtok() 拆分行。 To determine if each token is a word or a number, I'd use strtol() to attempt the conversion and check the error condition.要确定每个标记是单词还是数字,我会使用 strtol() 来尝试转换并检查错误情况。 Example:例子:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * Read the next line from the file, splitting the tokens into 
 * multiple strings and a single integer. Assumes input lines
 * never exceed MAX_LINE_LENGTH and each individual string never
 * exceeds MAX_STR_SIZE.  Otherwise things get a little more
 * interesting.  Also assumes that the integer is the last 
 * thing on each line.  
 */
int getNextLine(FILE *in, char (*strs)[MAX_STR_SIZE], int *numStrings, int *value)
{
  char buffer[MAX_LINE_LENGTH];
  int rval = 1;
  if (fgets(buffer, buffer, sizeof buffer))
  {
    char *token = strtok(buffer, " ");
    *numStrings = 0;
    while (token) 
    {
      char *chk;
      *value = (int) strtol(token, &chk, 10);
      if (*chk != 0 && *chk != '\n')
      {
        strcpy(strs[(*numStrings)++], token);
      }
      token = strtok(NULL, " ");
    }
  }
  else
  {
    /** 
     * fgets() hit either EOF or error; either way return 0
     */
    rval = 0;
  }
  return rval;
}
/**
 * sample main
 */
int main(void)
{
  FILE *input;
  char strings[MAX_NUM_STRINGS][MAX_STRING_LENGTH];
  int numStrings;
  int value;

  input = fopen("datafile.txt", "r");
  if (input)
  {
    while (getNextLine(input, &strings, &numStrings, &value))
    {
      /**
       * Do something with strings and value here
       */
    }
    fclose(input);
  }
  return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM