在SAS或R中读取原始数据

Question

For our analysis we need to read raw data from csv (xls) & convert it into SAS dataset before doing our analysis. 对于我们的分析，我们需要在分析之前从csv（xls）读取原始数据并将其转换为SAS数据集。

Now, the problem is this raw data generally have 2 issues: 1. The ordering of columns changes sometimes. 现在，问题在于该原始数据通常存在两个问题：1.列的顺序有时会更改。 So, if in the earlier period we have columns in order of variable A,then B, then C, etc. It might change to B, then C, then A. 2. There are foreign elements like "#", or ".", or "some letters", etc. Now, we have to first clean the raw data, before reading into SAS. 因此，如果在较早的时期我们按变量A的顺序排列列，则依次为B，然后C，等等。它可能会更改为B，然后是C，然后是A。2。有一些外来元素，例如“＃”或“”。 ”或“某些字母”等。现在，我们必须先清除原始数据，然后才能读取SAS。 This take considerable amount of time. 这花费了大量时间。 Is there any way we can clean the data within SAS system itself before reading the data. 在读取数据之前，我们有什么方法可以在SAS系统内部清理数据。 If we can rectify the data with SAS code, it will save quite amount of time. 如果我们可以使用SAS代码纠正数据，则将节省大量时间。

Here's the example: 这是示例：

Period 1: I got the data in Data1.csv in this format. 时段1：我以这种格式在Data1.csv中获得了数据。 In column B, which is numeric, I've "#" & ".". 在数字的B列中，我有“＃”和“。”。 And colummn C, which is also numeric, I've "g". 也是数字的C列，我有“ g”。 If I import Data1.csv using either PROC IMPORT or Infile statement, these foreign elements in column B & C will remain. 如果我使用PROC IMPORT或Infile语句导入Data1.csv，则B和C列中的这些外来元素将保留。 The question here is how to do that? 这里的问题是如何做到这一点？ I can use If STATEMENT. 我可以使用If STATEMENT。 But the problem is there are too many foreign elements (eg instead of "#", ".", "g", I might get other foreign elements like "$", "h" etc.) If there's any way we can have a code which detect & remove foreign elements without I've to specifying it using IF STATEMENT everytime I import the raw data in SAS. 但是问题在于，外来元素太多（例如，代替“＃”，“。”，“ g”，我可能会得到其他外来元素，例如“ $”，“ h”等。）每次我在SAS中导入原始数据时都无需使用IF STATEMENT指定即可检测和删除外来元素的代码。

   A    B   C
Name1   1   5
Name2   2   6
Name3   3   4
Name4   #   g
Name5   5   3
Name6   .   6

Period 2: In this period I got DATA2.csv which is given below. 期间2：在此期间，我得到了DATA2.csv，如下所示。 When I use INFILE statement, I specify 1st A should be read with the specific name, then B with specific name & then C. In 2nd period when I get the data B is given 1st. 当我使用INFILE语句时，我指定第一个应该使用特定名称读取A，然后使用特定名称读取B，然后读取C。在第二个周期中，当我得到数据B时被赋予第一。 So, when SAS read the data I've B instead of A. So, I've to check the variables ordering with previous phase data everytime & correct it before reading the data using infile statement. 因此，当SAS读取数据时，我使用的是B而不是A。因此，我必须每次使用前一阶段数据检查变量排序并在使用infile语句读取数据之前对其进行更正。 Since the number of variables are too large, it's very time consuming ( & at time frustrating) to verify the column ordering in this fashion. 由于变量的数量太大，因此以这种方式验证列顺序非常耗时（有时会令人沮丧）。 Is there SAS code, with which SAS will automatically read A,& then B & then C, even though it's not in this order? 是否有SAS代码，即使不是按此顺序，SAS也可以使用该代码自动读取A，B，C。

B   A   C
1   Name1   5
2   Name2   6
3   Name3   4
#   Name4   g
5   Name5   3
.   Name6   6

Even though I mainly use SAS in my analysis purpose. 即使我主要在分析目的中使用SAS。 But I can use R to clean the data, then use to read it in SAS for further analysis. 但是我可以使用R清除数据，然后用于在SAS中读取数据以进行进一步分析。 So R code can also be helpful. 因此，R代码也可能会有所帮助。

Thanks. 谢谢。

Answer 1

In R you increase the speed of file reading when you specify that a column is a particular class. 在R中，当您指定列为特定类时，可以提高文件读取的速度。 With the example provided (3 columns with the middle one being "character" you might use this code: 通过提供的示例（3列，中间一列为“字符”，您可以使用以下代码：

 dat <- read.csv( filename, colClasses=c("numeric", "character", "numeric"), comment.char="")

The "#" and "." “＃”和“。” would become NA values when encountered in the numeric columns. 在数字列中遇到时将变为NA值。 The above code removes the default specification of the comment character which is "#". 上面的代码删除了注释字符的默认规范，即“＃”。 If you wanted the "#" and "." 如果需要“＃”和“。”。 entries in character columns to be coerced to NA_character_, you could use this code: 要强制输入到NA_character_的字符列中的条目，可以使用以下代码：

dat <- read.csv( filename, 
                 colClasses=c("numeric", "character", "numeric"),
                 comment.char="",
                 na.strings=c("NA", ".", "#") )

By default the header=TRUE setting is assumed by read.csv(), but if you used read.table() you would need to assert header=TRUE with the two file structures you showed. 默认情况下，read.csv（）假定header=TRUE设置，但是如果使用read.table（），则需要使用显示的两个文件结构来声明header=TRUE 。 There is further documentation and worked examples of reading Excel data here: However, my advice is to do as you are planning and use CSV transfer. 这里有进一步的文档和读取Excel数据的有效示例：但是，我的建议是在计划和使用CSV传输时执行此操作。 You will see the screwy things Excel does with dates and missing values more quickly that way. 您将以这种方式更快地看到Excel处理日期和缺失值的棘手事情。 You would be well advised to change the data formats to a custom "yyyy-mm-dd" in agreement with the POSIX standard, in which case you can also specify a "Date" classed column and skip the process of turning character classed columns in the default Excel formats (all of which are bad) into dates. 建议您按照POSIX标准将数据格式更改为自定义“ yyyy-mm-dd”，在这种情况下，您还可以指定“ Date”分类列，并跳过在其中转换字符分类列的过程默认的Excel格式（所有格式都不正确）转换为日期。

Answer 2

Yes, you can use SAS to do any kind of "data cleaning" you might imagine. 是的，您可以使用SAS进行您可能想到的任何类型的“数据清理”。 The SAS DATA step language is full of features to do things like this, but there is no magic bullet; SAS DATA步骤语言具有执行此类操作的功能，但是没有神奇的子弹。 you need to write the code yourself. 您需要自己编写代码。

A csv file is just a plain text file (very different from an xls file). csv文件只是纯文本文件（与xls文件完全不同）。 Normally the first row in a csv file contains column names and the data begins with the second row. 通常， csv文件中的第一行包含列名，数据以第二行开头。 If you use PROC IMPORT , SAS will use the first row to construct variable names and try to determine data types by scanning the first several rows of the file. 如果使用PROC IMPORT ，则SAS将使用第一行来构造变量名称，并尝试通过扫描文件的前几行来确定数据类型。 For example: 例如：

proc import datafile='c:\temp\somefile.csv'
     out=SASdata
     dbms=csv replace;
run;

Alternatively, you can read the file with a data step. 或者，您可以通过数据步骤读取文件。 This would require that you know the file layout in advance. 这将要求您事先知道文件布局。 For example: 例如：

data SASdata;
   infile 'c:\temp\somefile.csv' dsd firstobs=2 lrecl=32767 truncover;
   informat A $50.; /* A character variable with max length 50 */
   informat B yymmdd10.; /* A date presented like 2012-08-25 */
   informat C dollar12.; /* A number containing dollar sign, commas, or decimals */

   input A B C;  /* The order of the variables in the file */

   if B = . then B = today(); /* A possible data cleaning statement */
run;

Note that the INPUT statement controls the order that the variables exist in the file. 注意，INPUT语句控制变量在文件中存在的顺序。 The point is that the code you use must match the layout of each file you process. 关键是您使用的代码必须与您处理的每个文件的布局匹配。

These are just general comments. 这些只是一般性评论。 If you encounter problems, post back with a more specific question. 如果遇到问题，请发回一个更具体的问题。

UPDATE FOR UPDATED QUESTION : The variables from the raw data file must be listed in the INPUT statment in the same order as they existin each file. 问题更新 ：原始数据文件中的变量必须按照与每个文件中存在的顺序相同的顺序在INPUT语句中列出。 Also, you need to define the column types directly, and establish whatever rules they need to follow. 另外，您需要直接定义列类型，并建立它们需要遵循的任何规则。 There is no way to do this automatically; 无法自动执行此操作； each file much be treated separately. 每个文件都应分开处理。

In this case, let's assume your variables are A, B, and C, where A is character and B and C are numbers. 在这种情况下，假设您的变量是A，B和C，其中A是字符，B和C是数字。 This program might process both files and add them to a history dataset (let's say ALLDATA): 该程序可能会处理这两个文件，并将它们添加到历史数据集中（假设为ALLDATA）：

data temp;
   infile 'c:\temp\data1.csv' dsd firstobs=2 lrecl=32767 truncover;
   /* Define dataset variables */
   informat A $50.;
   informat B 12.;
   informat C 12.;
   /* Add a KEEP statement to keep only the variables you want */
   keep A B C;

   input A B C;
run;
proc append base=ALLDATA data=temp;
run;
data temp;
   infile 'c:\temp\data2.csv' dsd firstobs=2 lrecl=32767 truncover;
   informat A $50.;
   informat B 12.;
   informat C 12.;

   input B A C;
run;
proc append base=ALLDATA data=temp;
run;

Notice that the "data definition" part of each data step is the same; 注意每个数据步骤的“数据定义”部分是相同的； the only difference is the order of variables listed in the INPUT statement. 唯一的区别是INPUT语句中列出的变量的顺序。 Notice that because the variables A and B are defined as numeric, when those invalid characters are read (# and g), the values are stored as missing values. 请注意，由于变量A和B被定义为数字，因此当读取那些无效字符（＃和g）时，这些值将存储为缺失值。

In your case, I'd create a template SAS program to define all the variables you want in the order you expect them to be. 对于您的情况，我将创建一个SAS模板程序，以按期望的顺序定义所有所需的变量。 Then use that template to import each file using the order of the variables in that file. 然后使用该模板按照该文件中变量的顺序导入每个文件。 Setting up the template program might take a while, but to run it you would only need to modify the INPUT statement. 设置模板程序可能需要一段时间，但是要运行它，您只需要修改INPUT语句即可。

在SAS或R中读取原始数据

问题描述

2 个解决方案

解决方案1
3 已采纳 2012-08-25 19:48:46

解决方案2
2 2012-08-25 18:18:19

在SAS或R中读取原始数据

问题描述

2 个解决方案

解决方案1 3 已采纳 2012-08-25 19:48:46

解决方案2 2 2012-08-25 18:18:19

解决方案1
3 已采纳 2012-08-25 19:48:46

解决方案2
2 2012-08-25 18:18:19