简体   繁体   English

在C中读取具有许多变量的.dat文件:动态构建格式字符串?

[英]Reading .dat files with many variables in C: dynamic build of format string?

So I am seeking to convert .dat files from the Core Wave data of the Survey of Income and Program Participation. 因此,我正在寻求从收入和计划参与调查的Core Wave数据转换.dat文件。 There are 15 waves with somewhere on the order of 300,000-400,000 observations each. 有15个波,每个波的观测量大约为300,000-400,000。 Each of these waves has an identical layout with 1023 variables. 这些波中的每一个都具有1023个变量的相同布局。 The .dat file structure does not contain any delimiters, so each observation is a string of characters (mostly digits, but some "-" signs). .dat文件结构不包含任何定界符,因此每个观察值都是一串字符(主要是数字,但是有些“-”符号)。 I have parsed the data dictionary with Python to capture the variable names, start positions, and size. 我已经使用Python解析了数据字典,以捕获变量名称,起始位置和大小。 I also wrote a script that converts these files to DataFrames to facilitate use with pandas. 我还编写了一个脚本,将这些文件转换为DataFrames,以方便与熊猫一起使用。 The problem is, my script has been running for over 36 hours, and I need to speed this up dramatically. 问题是,我的脚本已经运行了36多个小时,我需要大大加快它的运行速度。

Enter C. I am quite new to it, but all I want to do at this point is convert these .dat files to .csv. 输入C。我还很陌生,但是现在我要做的就是将这些.dat文件转换为.csv。 Yes, I could do a better job optimizing the Python using a couple packages that Anaconda has developed, but this seemed like a nice contained task to introduce myself to C. I have written a few small scripts to test this conversion, and here is the most relevant one: 是的,我可以使用Anaconda开发的几个软件包来更好地优化Python,但这似乎很不错,可以将C自我介绍给我。我编写了一些小脚本来测试这种转换,这是最相关的一个:

/*This script tests a conversion from .dat to .csv*/
#include <stdio.h>

main()
{
  /*Declare variables*/
  int var1;
  int var2;
  int var3;
  int i;

  /*Initialize pointers for source (test) and desitination (test_out) files*/
  FILE *test; /* test = test.dat file pointer */
  FILE *test_out; /* test_out = test.csv file pointer */

  /*Attempt to open the source file, and if it doesn't work, tell me about it */
  if ((test=fopen("O:\\Analyst\\Marvin\\scrap\\test.dat","r"))==NULL)
    printf("Source file could not be opened\n");
  /*Attempt to open the source file, and if it doesn't work, tell me about it */
  else if ((test_out=fopen("O:\\Analyst\\Marvin\\scrap\\test.csv","w"))==NULL)
    printf("Destination file could not be opened\n");
  /*If it does open, initiate read*/
  else{
    /*Write the headings to disk in the destination file*/
    fprintf(test_out,"%10s%10s%10s\n","Var 1,","Var 2,","Var 3");
    /*Initialize variables with the first row of data in the source file*/
    fscanf(test,"%3d%4d%3d",&var1,&var2,&var3);
    /*Initialize line counter*/
    i=1;
    /*For the remaining data lines in the source file...*/
    while (!feof(test)) {
      /*...write the last line's values to destination file...*/
      fprintf(test_out,"%9d%1s%9d%1s%10d\n",var1,",",var2,",",var3);
      /*...load the current line's values into the variable addresses...*/
      fscanf(test,"%3d%4d%3d",&var1,&var2,&var3);
      /*...and print (stout) then iterate the counter*/
      printf("%20s %d\n","Writing line #",i++);
    }
    /*Once the EOF is reached, close the source and destination files*/
    fclose(test);
    fclose(test_out);
  }
  /*Return 0 if everything has gone smoothly*/
  return 0;
}

The source file contains the following values: 源文件包含以下值:

0123456789
0123456789
0123456789
0123456789
0123456789

It outputs the following to the destination: 它将以下内容输出到目的地:

Var 1,    Var 2,     Var 3
   12,     3456,       789
   12,     3456,       789
   12,     3456,       789
   12,     3456,       789
   12,     3456,       789

So, this is all well and good, but I am dealing with over 1000 variables. 因此,这一切都很好,但是我正在处理1000多个变量。 Not only is the idea of writing out a format string that long vomit-inducing, but it also strikes me as bad practice >> way to many keystrokes. 写出一个长时间会引起呕吐的格式字符串的想法,不仅使我震惊,而且使我感到不舒服,这是许多击键的方式。 Given that I have the layout in an easily parseable file, I figure there has got to be some programmatic solution to this issue on both the input and output sides. 鉴于我将布局保存在一个易于解析的文件中,因此我认为必须在输入和输出端都采用某种编程解决方案来解决此问题。

There are just an unruly number of C-related input questions on SO. 关于SO的C相关输入问题数量不胜枚举。 The sample I have reviewed never seems to get at this question of parsing such a large volume of variables with a known layout. 我查看过的样本似乎从未遇到过用已知的布局解析如此大量的变量的问题。 Someone please enlighten me. 有人请赐教。

To read 1023 integers of various widths, suggest creating an array indicating the fixed width of each integer. 要读取1023个各种宽度的整数,建议创建一个数组,指示每个整数的固定宽度。 Use that width as an index to the desired format. 使用该宽度作为所需格式的索引。 Use long long to handle digit width up to 19 or so. 使用long long处理数字宽度最大为19左右。

#define MAX_WIDTH 4
int ReadLine(FILE *inf, const unsigned char *width, long long *data, size_t n) {
  static const char *format[MAX_WIDTH+1] = { 
      "%lld", "%1lld", "%2lld", "%3lld", "%4lld" };
  size_t i = 0;
  for (i = 0; i < n; i++) {
    if (width[i] > MAX_WIDTH) Handle_BadWidth(); 
    if (fscanf(inf, format[width[i]], &data[i]) != 1) {
      return 1;  // FAIL
    }
  }
  int eol = fgetc(inf);
  if (eol != '\n' && eol != EOF)
    return 2;  // FAIL
  return 0;
}

// Sample use - error handling omitted.
#define N (3)
int main(void) {
  FILE *inf = fopen("test.dat","r");
  long long data[N];

  unsigned char width[N];
  FormWidths("input.csv", width, N);
  return ReadLine(inf, width, data, N); 
}

OP explained that a .CSV file existed that the widths could be read in the form of "VarN,width" OP解释说存在一个.CSV文件,其宽度可以以“ VarN,width”的形式读取

int FormWidths(const char *fname, unsigned char *width, size_t n) {
  FILE *inf = fopen(fname, "r");
  if (!inf) return 1;  // error
  char buf[100];
  fgets(buf, sizeof buf, inf);  // Use this line if file has table header
  size_t i;
  for (i=0; i<n; i++) {
    if (fgets(buf, sizeof buf, inf) == NULL) break;
    int index, w;
    if (sscanf(buf, "Var%d ,%d", &index, &w) != 2) break;
    if (index != i) break;
    // Adjust MAX_WIDTH to larger values like 19 and then update the format table
    if (w < 1 || w > MAX_WIDTH) break; 
    width[i] = w;
  }
  fclose(inf);
  if (i != n) return 1; // Did not read .csv as expected
  return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM