简体   繁体   English

在C中使用字符串数组数组来解析文本文件

[英]Use of array of arrays of string in C for parsing text file

I would like to read from N text files (having similar structure: a few lines, each line having the same small number of words) and store in a string matrix the words read, in such a way that in each (row, col) position I have one word. 我想从N个文本文件中读取(结构相似:几行,每行具有相同的少量单词),并将读取的单词存储在字符串矩阵中,使得每个行(行,列)位置,我有一个字。

A simple (two lines, three words per line) specimen for the files is the following: 文件的一个简单样本(两行,每行三个单词)如下:

line1word1 line1word2 line1word3
line2word1 line2word2 line2word3

Delimiter for the words is space. 单词的定界符是空格。

I have attempted this code: 我尝试了这段代码:

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_STRING_LENGTH 1000
#define MAX_TOKS 100
#define DELIMITERS " "

// line parsing utility
int parseString(char* line, char*** argv) {

  char* buffer;
  int argc;

  buffer = (char*) malloc(strlen(line) * sizeof(char));
  strcpy(buffer,line);
  (*argv) = (char**) malloc(MAX_TOKS * sizeof(char**));

  argc = 0;  
  (*argv)[argc++] = strtok(buffer, DELIMITERS);
  while ((((*argv)[argc] = strtok(NULL, DELIMITERS)) != NULL) &&
     (argc < MAX_TOKS)) ++argc;
  return argc; 
}


int main() {

  char S[MAX_STRING_LENGTH];
  char **A;

  int  n,i,j,l;

  FILE *f;
  char file[50];

  char ***matrix;
  matrix = malloc(MAX_TOKS * sizeof(char**));

 //memory allocation for matrix
 for (i = 0; i < MAX_TOKS; i++)
     {
       matrix[i] = malloc(MAX_TOKS * sizeof(char *));
       for (j = 0; j < MAX_TOKS; j++)
           {
           matrix[i][j] = malloc(MAX_TOKS * sizeof(char));
           }
     }

  int NFILE = 10; // number of files to be read

  for(i=0;i<NFILE;i++) 
    {  
    sprintf(file,"file%d.txt",i); 
    f = fopen(file,"r");

    l=0; // line-in-file index
    while(fgets(S,sizeof(S),f)!=NULL) {
          n = parseString(S,&A);
          for(j=0;j<n;j++) {
            matrix[i][l]=A[j];
            printf("%s\t%s\n",matrix[i][l],A[j]); 
            } 
        l++;
        } 
 fclose(f); 
    }

free(matrix);
free(A);    
return(0);  
}

The problem I can't solve is that there when checking for correspondance between the arrays (in order to be sure I am storing the single words correctly) using 我无法解决的问题是,在检查数组之间的对应关系时(为了确保我正确地存储了单个单词)使用

printf("%s\t%s\n",matrix[i][l],A[j]);

I find that the last word (and only the last one) of each line, regardless of the file number, is not stored in matrix . 我发现无论文件号如何,每行的最后一个单词(也只有最后一个)没有存储在matrix That is to say, line1word1 and line1words of file0 are correctly stored in matrix[0][0][0] and matrix[0][0][1] , but in the field matrix[0][0][2] there isn't line1word3 , even if A[2] has it! 也就是说, line1word1line1wordsfile0被正确地存储在matrix[0][0][0]matrix[0][0][1] ,但在该领域matrix[0][0][2]即使A[2]有,也没有line1word3

What am I doing wront? 我在做什么呢? Any suggestion? 有什么建议吗?

Many thanks in advance, cheers 预先非常感谢,欢呼

char ***matrix doesn't declare a three dimensional array. char ***matrix没有声明三维数组。 Your matrix would need to be something like char *matrix[a][b] to hold a two dimensional array of string pointers. 您的矩阵需要像char *matrix[a][b]来保存字符串指针的二维数组。 In order to calculate addresses within an array, the compiler needs to know the all of dimensions but one. 为了计算数组中的地址,编译器需要知道除一个维之外的所有维。 If you think about it, you will probably see why... 如果您考虑一下,您可能会明白为什么...

If you have two arrays: 如果您有两个数组:

1 2 3        1  2  3  4  5  6  7
4 5 6        8  9 10 11 12 13 14
7 8 9       15 16 17 18 19 20 21

You can see that item[1][1] is NOT the same item. 您可以看到item[1][1] 不是同一项目。 Regardless of the dimensions in your array, the elements are typically arranged sequentially in memory, with each row following the previous (or possible column, depending on language, I suppose.) If you have an array of pointers, the actual content may be elsewhere, but the points would be arranged like this. 不管数组中的维数如何,元素通常在内存中顺序排列,每一行都在上一行(或可能的列,我想这取决于语言)之后。如果您有一个指针数组,则实际的内容可能在其他地方,但是这些点的排列方式是这样的。 So, in my examples above, you must provide the compiler with the number of columns so that it can find members (the number of rows can be variable.) In a three dimensional array, you must provide the first TWO dimensions so that the compiler may calculate item offsets. 因此,在上面的示例中,必须为编译器提供列数,以便它可以找到成员( 数可以是可变的。)在三维数组中,必须提供前两个维,以便编译器可以计算项目偏移量。

I hope that helps. 希望对您有所帮助。

EDIT: You can have truly dynamic array dimensions by creating your own function to process all array item accesses. 编辑:您可以通过创建自己的函数来处理所有数组项访问拥有真正的动态数组尺寸。 The function would need to know the dynamic dimensions and the item index(s) so that it could calculate the appropriate address. 该功能将需要知道动态尺寸和项目索引,以便可以计算适当的地址。

This looks wrong: buffer = (char*) malloc(strlen(line) * sizeof(char)); 这看起来是错误的: buffer = (char*) malloc(strlen(line) * sizeof(char));

Firstly, there is no need to cast malloc in C. If your code doesn't compile without the cast, there are two possible reasons: 首先,不需要在C中强制转换malloc。如果没有强制转换就无法编译代码,则可能有两个原因:

  1. There is no prototype for malloc. 没有malloc的原型。 Obviously this can cause problems, because no prototype means the function returns a default type: int , or an error occurs. 显然,这可能会引起问题,因为没有原型意味着函数将返回默认类型: int ,或者发生错误。 This can cause your program to misbehave. 这可能会导致您的程序行为异常。 To avoid this, #include <stdlib.h> . 为了避免这种情况,请#include <stdlib.h>
  2. You're using a C++ compiler. 您正在使用C ++编译器。 Stop. 停止。 Either program in C++ (stop using malloc) or use a C compiler. 使用C ++程序(停止使用malloc)或使用C编译器。 If you want to use this project in a C++ project, compile your C code with a C compiler and link to it in your C++ compiler. 如果要在C ++项目中使用此项目,请使用C编译器编译C代码,并在C ++编译器中链接到它。

Secondly, sizeof(char) is always 1. There is no need to multiply by it. 其次,sizeof(char)始终为1。不需要乘以它。

Thirdly, a string is a sequence of characters ending at the first '\\0'. 第三,字符串是一个以第一个“ \\ 0”结尾的字符序列。 This means a string always occupies at least 1 character, even if it is an empty string. 这意味着即使一个空字符串,它也总是至少占用1个字符。 What does strlen("") return? strlen("")返回什么? What is sizeof("") ? 什么是sizeof("") You need to add 1 to make room for the '\\0': buffer = malloc(strlen(line) + 1); 您需要添加1来为'\\ 0'腾出空间: buffer = malloc(strlen(line) + 1); .

This looks slightly wrong: (*argv) = (char**) malloc(MAX_TOKS * sizeof(char**)); 这看起来有点不对劲: (*argv) = (char**) malloc(MAX_TOKS * sizeof(char**));

malloc returns a pointer to an object. malloc返回一个指向对象的指针。 *argv is a char ** , which means it points to a char * . *argv是一个char ** ,这意味着它指向一个char * However, in this case malloc returns a pointer to char ** objects. 但是,在这种情况下,malloc返回指向char **对象的指针。 The representation isn't required to be identical. 表示形式不必相同。 To avoid portability issues assosciated with this, follow this pattern variable = malloc(n * sizeof *variable); 为了避免与此相关的可移植性问题,请遵循以下模式variable = malloc(n * sizeof *variable); ... in this case, *argv = malloc(MAX_TOKS * **argv); ...在这种情况下, *argv = malloc(MAX_TOKS * **argv);

It gets more gritty as it goes. 它变得越来越坚韧。 Forget everything you think you know about your code; 忘记您认为对代码了解的一切; Pretend you're going to come back to this in 24 months. 假装您将在24个月后回到这个问题。 What are you going to think of this? 您会怎么想呢?

argc = 0;  
(*argv)[argc++] = strtok(buffer, DELIMITERS);
while ((((*argv)[argc] = strtok(NULL, DELIMITERS)) != NULL) &&
   (argc < MAX_TOKS)) ++argc;

There's actually an off-by-one here, too. 实际上,这里也一一提供。 Assuming argc == MAX_TOKS , your loop would attempt to assign to (*argv)[MAX_TOKS] . 假设argc == MAX_TOKS ,您的循环将尝试分配给(*argv)[MAX_TOKS] This loop is where I believe your problem lies, and the solution is to express your intent more clearly rather than attempting to cram as much code into one line as possible. 我认为这个循环是您的问题所在,解决方案是更清楚地表达您的意图,而不是尝试将尽可能多的代码塞进一行。 How would you rewrite this? 您将如何重写呢? Here's what I'd do, in this situation: 在这种情况下,我会这样做:

char *arg;
size_t argc = 0;
do {
    arg = strtok(buffer, DELIMITERS);
    buffer = NULL;

    (*argv)[argc] = arg;
    argc++;
} while (argc < MAX_TOKS && arg != NULL);

The problem is that your parsing loop doesn't increment when strtok returns NULL. 问题是当strtok返回NULL时,解析循环不会增加。 Hence, your function returns the position of the last item. 因此,您的函数将返回最后一项的位置。 Supposing you had two tokens, your parsing function would return 1. Your display loop displays items up to, but not including this position: for(j=0;j<n;j++) . 假设您有两个标记,则解析函数将返回1。您的显示循环将显示以下项目,但不包括以下位置: for(j=0;j<n;j++) You could use the suggested improvement, or change your loop: for (j = 0; j <= n; j++) . 您可以使用建议的改进,也可以更改循环: for (j = 0; j <= n; j++) Either way, you'll need to fix those off-by-ones. 无论哪种方式,您都需要逐个修复。

Out of curiosity, which book are you reading? 出于好奇,您正在阅读哪本书?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM