将多行文本数据读入单个条目

Question

我看到了与我的问题非常相似的问题，但无法弄清楚我的问题是什么。 我正在尝试读取 pipe 分隔的文本文件，其中一些条目位于如下两行中。 这些变量是：ID、OCC、DESCRIPTION、V1-V19，其中 V1 到 V19 是不同的变量。 当我在没有 V1-V19 的情况下运行代码时，它就像一个魅力，但是当将它们添加到它时，即使是“测试”以查看下一行是否为数字也无法正常工作。

这是我的代码，

data sample;

infile myfile  dlm='|' dsd  end=endin ;
length schedule $12 desc $300;
input @1 schedule $ occ $ desc $  v1-v15 $ v16 $ v17-v19;
input @@;
test= notdigit(substrn(left(_infile_),1,1));
If test then do;
  desc=catx(' ',desc,_infile_);
  input;
end;
run;

这是测试数据集，

9450007|23023|Reporter||33100|||||1|||||||||D||49|1
9451007|23023|Reporter||43086||||||1||||||||E||50|1
9462034|11021|Manager 
Oversee all operations, report to board of director|||||||||12|||||||F||1|12
9460034|43061|Office Assistant
Schedule client appointments, enter visit|||||10|||||||||||B||3|10
9451002|24011|Engineer, Market Exempt||86353||||||||1||||||G||28|1
9450002|12021|Market President/Chief Revenue
Officer||135000||||||||||1||||I||29|1 
9460027|11111|Mgr, Emergency Care||131248||||||||||1||||I||208|1

Answer 1

问题不是故意的多行数据。 相反，看起来其中一个字段的值有时包含行尾字符。 要解决此问题，您需要一种删除（或替换）这些字符的方法。 由于在第一个或最后一个字段中似乎没有出现额外的行尾字符，因此应该很容易修复。 您可以只计算您看到的字段数，并且仅在它们达到您期望每行的字段数时才写入行尾字符。

所以假设你有一个名为 TXT 的文件引用指向原始文件。 这是将其转换为新文件的代码，其中删除了额外的行尾字符。 预计每行有 22 个字段。

filename fix temp;
data _null_;
  infile txt ;
  file fix;
  input ;
  nwords=countw(_infile_,'|','mq');
  putlog _n_= nwords= _infile_;
  put _infile_ @;
  total+nwords;
  if total>=22 then do; put ; total=0; end;
  else put ' ' @;
run;

我们可以读回这些行以检查它是否有效。

492   data _null_;
493     infile fix;
494     input ;
495     nwords=countw(_infile_,'|','mq');
496     putlog _n_= nwords= _infile_;
497   run;

NOTE: The infile FIX is:
      Filename=...\#LN00081,
      RECFM=V,LRECL=32767,File Size (bytes)=523,
      Last Modified=04Apr2020:11:29:46,
      Create Time=04Apr2020:11:29:46

_N_=1 nwords=22 9450007|23023|Reporter||33100|||||1|||||||||D||49|1
_N_=2 nwords=22 9451007|23023|Reporter||43086||||||1||||||||E||50|1
_N_=3 nwords=22 9462034|11021|Manager Oversee all operations, report to board of director|||||||||12|||||||F||1|12
_N_=4 nwords=22 9460034|43061|Office Assistant Schedule client appointments, enter visit|||||10|||||||||||B||3|10
_N_=5 nwords=22 9451002|24011|Engineer, Market Exempt||86353||||||||1||||||G||28|1
_N_=6 nwords=22 9450002|12021|Market President/Chief Revenue Officer||135000||||||||||1||||I||29|1
_N_=7 nwords=22 9460027|11111|Mgr, Emergency Care||131248||||||||||1||||I||208|1
NOTE: 7 records were read from the infile FIX.
      The minimum record length was 51.
      The maximum record length was 98.

现在可以正常读取文件了；

data want;
  infile fix dsd dlm='|' truncover;
  input schedule :$12. occ :$8. desc :$300. (v1-v16) (:$8.) v17-v19;
run;

Answer 2

对于在数据值或定界符内拆分的健壮定界数据，可以将部分行连接成保留的保留行。 当保留的行包含预期数量的分隔符 ( 21 | ) 时，可以将保留的行推回输入缓冲区，然后可以从中读取input语句。

例子：

* create data file for example code;

filename the_data temp;

data _null_;
file the_data;
input;
put _infile_;
datalines;
9450007|23023|Reporter||33100|||||1|||||||||D||49|1
9451007|23023|Reporter||43086||||||1||||||||E||50|1
9462034|11021|Manager 
Oversee all operations, report to board of director|||||||||12|||||||F||1|12
9460034|43061|Office Assistant
Schedule client appointments, enter visit|||||10|||||||||||B||3|10
9451002|24011|Engineer, Market Exempt||86353||||||||1||||||G||28|1
9450002|12021|Market President/Chief Revenue
Officer||135000||||||||||1||||I||29|1 
9460027|11111|Mgr, Emergency Care||131248||||||||||1||||I||208|1
;

* manipulate _infile_ associated with external file in order to
* use input statement for data line, and field, split over multiple lines;

data want;
  length schedule $12 occ $8 desc $300;
  length v1-v16 $8 v17-v19 8;
  length heldline $256;

  infile the_data dlm='|' dsd missover;

  * read data file line into input buffer and _infile_;
  input @;

  if missing(heldline) then 
    /* first line or data from prior 'data' was output */
    /* copy the input buffer to a retained variable (in case the 'data' is split/incomplete) */
    heldline = _infile_;      
  else
    /* heal the split */
    heldline = catx (' ', heldline, _infile_);

  putlog // _N_ / '# ' _infile_ / '* ' heldline;

  retain heldline;

  if countc(heldline,'|') = 21 then do;
    _infile_ = heldline;                    /* proper number of delimiters, push heldline into input buffer */
    input @1 schedule: occ: desc: v1-v19;   /* read 'data' */
    OUTPUT;
    heldline = '';                          /* clear heldline */
  end;

  drop heldline;
  format v1-v16 $4. v17-v19 4. v2 $6.;
run;

笔记：

从 DATALINES 读取数据时，该示例将不起作用。 将值分配给 DATALINES infile的 DATA 步将导致输入缓冲区被截断为 80 个字符。

将多行文本数据读入单个条目

问题描述

2 个解决方案

解决方案1
0 2020-04-04 15:26:25

解决方案2
0 2020-04-06 02:21:24

将多行文本数据读入单个条目

问题描述

2 个解决方案

解决方案1 0 2020-04-04 15:26:25

解决方案2 0 2020-04-06 02:21:24

解决方案1
0 2020-04-04 15:26:25

解决方案2
0 2020-04-06 02:21:24