简体   繁体   English

C将文件逐行读取到字符串数组中并进行排序

[英]C read file(s) line-by-line into array of Strings and sort

So I want to create a basic C application mysort that takes a list of files, reads each of them line by line into a buffer and sorts the lines alphabetically. 因此,我想创建一个基本的C应用程序mysort ,该文件接受一个文件列表,将每个文件逐行读取到缓冲区中,然后按字母顺序对行进行排序。 The code looks something likes this (plus parameter parsing, etc): 代码看起来像这样(加上参数解析等):

//How do I initialize an array of 1024byte-Strings with an unknown amount of fields?
char** lines; 
int lineNum = 0;

for(int num_files = j; num_files < argc; num_files++){ //iterate through all files
  FILE * filepointer ;
  char * line = NULL;
  size_t len = 0;
  ssize_t read;

  filepointer = fopen(argv[num_files], "r");    
  if (filepointer == NULL)
    exit(EXIT_FAILURE);

  //TODO: write each line into a new spot of the array, this try doesn't work!

  while ((read = getline(&line, &len, filepointer)) != -1) { 
    //the lines may be assumed to be a max of 1024 bytes
    lines[lineNum] = malloc(1024 * sizeof(char)); 
    //lines[lineNum] = line;
    strcpy(lines[lineNum], line);
    lineNum++;
  }

  fclose(fp);
  if (line)
    free(line);

  //These values might be wrong, but that isn't the issue I'm adressing
  //just for illustration
  qsort(lines , argc - 1, sizeof(char *), cmpstringp) 

  //do something with the sorted lines
}

Since I have to use qsort(3) , I need to produce a char** holding all the lines at some point. 因为我必须使用qsort(3) ,所以我需要生成一个char**保存所有行。

What's a good way to accomplish such a task? 完成这项任务的好方法是什么? Do I need my own data structure in order to dynamically store several identical objects? 是否需要我自己的数据结构才能动态存储几个相同的对象?

The lines char** Array isn't initialized here, so the program doesn't work. char ** Array lines未在此处初始化,因此该程序无法正常工作。 But since the number of lines is completely unknown at the start of the program it may not be explicitly defined (Unless you know a clever function to figure this out) 但是由于行数在程序开始时是完全未知的,因此可能未明确定义(除非您知道一个聪明的函数可以解决此问题)

The only ways I figured out so far is defining my own dynamic datastructure (eg LinkedList) or to parse all files twice in order to determine the number of lines that will be produced. 到目前为止,我发现的唯一方法是定义自己的动态数据结构(例如LinkedList)或两次解析所有文件,以确定将要产生的行数。

Both seem very un-elegant to me, but maybe I'm just not accustomed to C code. 两者对我来说似乎都不是那么优雅,但是也许我只是不习惯C代码。

Two ways i see of solving the problem: 我看到的两种解决问题的方法:

1) Go through the file, counting the number of new line characters(and saving it into nl_count ), then you can allocate lines like this. 1)浏览文件,计算换行符的数量(并将其保存到nl_count中 ),然后可以像这样分配行。

int nl_count = 0;
int c;

while ((c = fgetc(fp)) != EOF)
   if (c == '\n')
      nl_count++;
...
lines = malloc(nl_count * sizeof(char *));


This way you will have to cover some special cases in your cmpstringp function, cause u may get some lines which only contain '\\n'. 这样,您将不得不在cmpstringp函数中涵盖一些特殊情况,因为u可能会得到一些仅包含'\\ n'的行。
( edit 1. Actually in either case you will have to check for this special case.) 编辑 1.实际上,无论哪种情况,您都必须检查这种特殊情况。)
( edit 2. You can get off by one bug, cause last line doesn't have to end with '\\n'.) 编辑 2。您可能会遇到一个bug,因为最后一行不必以'\\ n'结尾。)

2) Set some base size for lines and reallocate for more space when the actual number of lines read reaches this base size. 2)为行设置一些基本大小,并在读取的实际行数达到此基本大小时重新分配更多空间。

#define BASE_SIZE 32
#define GROW_STEP 2

int size;

size = BASE_SIZE
lines = malloc(size * sizeof(char *));

lines_read = 0;
while ((read = getline(&line, &len, fp)) != -1) { 
   lines_read++;
   if (lines_read > size) {
       size *= GROW_STEP;
       lines = realloc (lines, size * sizeof (char *));
   }
   lines[lineNum] = strdup(line);
   lineNum++;
}

Notice that in worst case you will allocate twice as much space than it is really needed. 请注意,在最坏的情况下,您分配的空间将是实际需要的两倍。
Also, you should free memory allocated if u use strdup(). 另外,如果您使用strdup(),则应该释放分配的内存。

...
for (i = 0; i < lines_read; i++)
    free(lines[i]);
 //How do I initialize an array of 1024byte-Strings with an unknown amount of fields? 

Obviously, you don't. 显然,您没有。 If you initialize something, then at that point you know all the details of that thing. 如果初始化某个东西,那么到那时,您将知道该东西的所有详细信息。

I suppose you're asking how to reserve memory for an unknown number of string pointers, but again, you don't. 我想您是在问如何为未知数量的字符串指针保留内存,但是同样,您没有。 Moreover, note that the 1024-byte restriction is unnecessary for an array of char * such as you propose; 此外,请注意,对于像您建议的char *数组,1024字节的限制是不必要的。 it would be relevant only if you intended to structure the data as a 2D array of char . 仅当您打算将数据构造为char的2D数组时才有意义。 After you have read a string, you know how much space it requires, so for example, I observe that this code ... 阅读字符串后,您知道它需要多少空间,因此,例如,我观察到此代码...

  //the lines may be assumed to be a max of 1024 bytes lines[lineNum] = malloc(1024 * sizeof(char)); //lines[lineNum] = line; strcpy(lines[lineNum], line); 

... would be both simpler and without inherent size limit if it were written as: 如果将其写为:...将既简单又不受固有大小限制:

    lines[linenum] = strdup(line);

In fact, that would use less space, too, in the event that your lines average fewer than 1023 characters. 实际上,如果您的行平均少于1023个字符,那也将占用更少的空间。

With respect to space for the overall array, what you can do is reserve memory in increments as you go. 关于整个阵列的空间,您可以做的是随您的需要递增地保留内存。 That could mean initially malloc() ing space for several strings, and realloc() ing to get more space when needed. 这可能意味着最初malloc()为多个字符串分配空间,然后在需要时realloc()获得更多空间。 It could also mean initially reading the strings into a linked list of either individual strings or fixed-size arrays of strings, and then building your monolithic array after you know how many strings there are. 这也可能意味着首先将字符串读入单个字符串或固定大小的字符串数组的链接列表中,然后在知道有多少个字符串之后构建单片数组。

The linked list alternative transiently requires twice as much storage for the string pointers, but that's not too bad because the string contents do not need to be duplicated. 链表的替代方案暂时需要为字符串指针提供两倍的存储空间,但这还不错,因为不需要重复字符串内容。 This has the advantage of relatively low memory allocation cost relative to some naive implementations of the malloc() / realloc() approach. malloc() / realloc()方法的某些简单实现相比,这具有相对较低的内存分配成本的优势。

Because reallocation usually requires copying all the data (in this case, the pointers) from the one block to a new, larger one, you generally want to limit the number of reallocations. 因为重新分配通常需要将所有数据(在本例中为指针)从一个块复制到一个更大的新块,所以您通常希望限制重新分配的数量。 The usual strategy for this in a case such as yours is to step up the allocation sizes geometrically instead of linearly. 在您这样的情况下,通常的策略是以几何方式而不是线性方式增加分配大小。 That is, each time you find you need more space, you allocate new space sufficient for, say, twice as many strings as you already have. 也就是说,每次发现需要更多空间时,便分配新的空间,足以容纳比现有字符串多一倍的字符串。 The total cost for that scales linearly in the number of data. 这样做的总成本与数据数量成线性比例。 Although it may seem wasteful in the event that it turns out you needed only a little more space, it still doesn't require any more space than a linked list + transformation to dynamic array would require. 虽然它可能在事件似乎浪费它原来你只需要多一点空间,但它仍然不需要任何超过链表+转化为动态数组将需要更多的空间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM