简体   繁体   English

SAS Proc 导入 CSV 和丢失的数据

[英]SAS Proc Import CSV and missing data

So, I'm trying to import some datasets in SAS and join them, the only problem is that I get this error after joining them -所以,我试图在 SAS 中导入一些数据集并加入它们,唯一的问题是加入它们后我收到了这个错误 -

    proc import datafile='filepath/datasetA.csv'
    out = dataA
    dbms= csv
    replace;
    run;


    proc import datafile='filepath\datasetB.csv'
    out = dataB
    dbms= csv
    replace;
    run;



    /* combine them all into one dataset*/


    data DataC;
    set &dataA. &dataB;

    run;



    ERROR: Variable column_k has been defined as both character and numeric

The column in question looks something like this in both of the data sets that I'm trying to join -在我尝试加入的两个数据集中,有问题的列看起来像这样 -

+----------+
| column_k |
+----------+
| 0        |
| 1        |
| 5        |
| 4        |
| NA       |
| NA       |
| 4        |
| 3        |
| NA       |
+----------+

Basically, I would like to import the NA data in that column as 'missing', if that's possible?基本上,如果可能的话,我想将该列中的 NA 数据导入为“缺失”? I need the entire column to remain numeric as I'm planning on doing some mathematical stuff with the data in that column further down the line.我需要整个列保持数字,因为我计划对该列中的数据进行一些数学运算。

Thanks for your help!谢谢你的帮助!

If you wish to continue using Proc IMPORT then you will need to ensure the columns are like-typed.如果您希望继续使用Proc IMPORT那么您需要确保列的类型相同。 In your case you know column_k should be numeric, so a DATA step can convert the character values to numeric using the INPUT function.在您的情况下,您知道column_k应该是数字,因此DATA步骤可以使用INPUT函数将字符值转换为数字。

proc import … out = dataA;
proc import … out = dataB;

data dataA;
  set dataA;
  _num = input(column_k, best12.);
  drop column_k;
  rename _num = column_k;
run;

data dataB;
  set dataB;
  _num = input(column_k, best12.);
  drop column_k;
  rename _num = column_k;
run;

data want;
  set dataA dataB;
run;

In a larger scope mismatched data types for a column name can occur in a scenario such as dealing with multi-year imports.在更大范围内,列名的数据类型不匹配可能发生在处理多年导入等场景中。

Suppose the older data can't be re-read and the newer data has different column type.假设不能重新读取旧数据并且新数据具有不同的列类型。

For the case of wanting numeric values, one approach is to have macro that writes source code that converts, if necessary, specified variables from character to numeric.对于需要数值的情况,一种方法是使用宏编写源代码,必要时将指定的变量从字符转换为数字。

Example:例子:

%enforce_num (perm.loans2015, age amount remaining, out=work.loans2015)
%enforce_num (perm.loans2016, age amount remaining, out=work.loans2016)
%enforce_num (perm.loans2017, age amount remaining, out=work.loans2017)

data loans_3yrs; 
  set work.loans2015-loans2017;
run;

Going back to your simpler case:回到你更简单的案例:

proc import … out = dataA;
proc import … out = dataB;

%enforce_num(dataA, column_k)
%enforce_num(dataB, column_k)

data want;
  set dataA dataB;
run;

What would the macro enforce_num look like?enforce_num会是什么样子? It would have to:它必须:

  • scan the input data set meta data扫描输入数据集元数据
  • determine if a variable is one of those specified and is character type确定一个变量是否是指定的变量之一并且是字符类型
    • write source code to convert the variable to numeric编写源代码将变量转换为数字
    • maintain original variable order保持原来的变量顺序
%macro enforce_num(data, vars, out=&data);

  /*
   * Arguments:
   *   data - name of input data set
   *   vars - space separated list of variables that must be numeric, convert type if necessary
   *   out  - name of output data set, default same as input data set
   *
   * Output:
   *   - Unchanged data set if data and out are the same and no conversion needed
   *   - Changed data set if some columns in data need conversion to numeric
   *     - replaces data if out is same as data
   *     - replaces out if out is different then data
   *     - the column order of the changed data set will be the same as the original data set
   */

  %local dsid index index2 vars varname vartype varnames debug;

  %let index2 = 0;  %* number of variables determined to be requiring conversion;
  %let debug = 0;

  %if &debug %then %put NOTE: &SYSMACRONAME: data=%superq(data);

  %let dsid = %sysfunc(open(&data));
  %if &dsid %then %do;
    %do index = 1 %to %sysfunc(attrn(&dsid, nvars));
      %let varname = %sysfunc(varname(&dsid, &index));

      %let varnames = &varnames &varname;

      %if %sysfunc(indexw(&varname, &vars)) %then %do;
        %if C = %sysfunc(vartype(&dsid, &index)) %then %do;
          %* Data contains character variable requiring enforcement;
          %let index2 = %eval(&index2+1);
          %local convert&index2;
          %let convert&index2 = &varname;

          %let varnames = &varnames ___&index2 ;   %* Variables that will be converted will be named __<#> during conversion;
        %end;
      %end;
    %end;
    %let dsid = %sysfunc(close(&dsid));
  %end;
  %else
    %put %sysfunc(sysmsg());

  %*put NOTE: &=vars;
  %*put NOTE: &=varnames;

  %if &index2 = 0 %then %do;
    %* No columns need to be converted to numeric, copy to out if necessary;
    %if &data ne &out %then %do;
      data &out;
        set &data;
      run;
    %end;
    %return;
  %end;

  %* Some columns need to be converted to numeric;
  %* Ensure the converted column is at the same position (varnum) as in the original data set;

  data &out;
    retain &varnames;

    set &data;

    %do index = 1 %to &index2;
      ___&index = input(&&convert&index,?? best12.);
    %end;

    drop
      %do index = 1 %to &index2;
        &&convert&index
      %end;
    ;

    rename
      %do index = 1 %to &index2;
        ___&index = &&convert&index
      %end;
    ;
  run;

  %put NOTE: ------------------------------------------------;
  %put NOTE: &data has been subjected to numeric enforcement.;
  %put NOTE: ------------------------------------------------;
%mend enforce_num;

proc import is a guessing procedure and works by examining a few rows of data.This is a problem because Excel data cells have no data type whatsoever. proc import是一个猜测过程,通过检查几行数据来工作。这是一个问题,因为 Excel 数据单元格没有任何数据类型。 A column can have text, date, datetime and numeric values in different cells.一列可以在不同的单元格中包含文本、日期、日期时间和数值。

So, better to use infile statement with specified variable types:因此,最好使用具有指定变量类型的infile语句:

filename input 'filepath/datasetA.csv';

data dataA;
   infile input truncover firstobs=2/*reads from the second line*/;
   input column_k;/*here you should specify input variables. If you want to read column_k as character, use : "input column_k $100." with specified length*/
run;

filename input clear;

Input(csv file):输入(csv文件):

+----------+
| column_k |
+----------+
| 0        |
| 1        |
| 5        |
| 4        |
| NA       |
| NA       |
| 4        |
| 3        |
| NA       |
+----------+

Output (sas dataset dataA):输出(作为数据集dataA):

+----------+
| column_k |
+----------+
|        0 |
|        1 |
|        5 |
|        4 |
|        . |
|        . |
|        4 |
|        3 |
|        . |
+----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM