简体   繁体   English

将多行文本数据读入单个条目

[英]Reading multiple lines of text data into a single entry

I have seen very similar problems to mine but could not figure out what my problem is.我看到了与我的问题非常相似的问题,但无法弄清楚我的问题是什么。 I am trying to read in a pipe delimited text file where some of the entries are in two lines as below.我正在尝试读取 pipe 分隔的文本文件,其中一些条目位于如下两行中。 The variables are: ID, OCC, DESCRIPTION, V1-V19, where V1 through V19 different variables.这些变量是:ID、OCC、DESCRIPTION、V1-V19,其中 V1 到 V19 是不同的变量。 When I run the code without the V1-V19 then it works like a charm but then when add them to it even the 'test' to see if the next line is numeric or not fails to work.当我在没有 V1-V19 的情况下运行代码时,它就像一个魅力,但是当将它们添加到它时,即使是“测试”以查看下一行是否为数字也无法正常工作。

Here is my code,这是我的代码,

data sample;

infile myfile  dlm='|' dsd  end=endin ;
length schedule $12 desc $300;
input @1 schedule $ occ $ desc $  v1-v15 $ v16 $ v17-v19;
input @@;
test= notdigit(substrn(left(_infile_),1,1));
If test then do;
  desc=catx(' ',desc,_infile_);
  input;
end;
run;

and here is the test dataset,这是测试数据集,

9450007|23023|Reporter||33100|||||1|||||||||D||49|1
9451007|23023|Reporter||43086||||||1||||||||E||50|1
9462034|11021|Manager 
Oversee all operations, report to board of director|||||||||12|||||||F||1|12
9460034|43061|Office Assistant
Schedule client appointments, enter visit|||||10|||||||||||B||3|10
9451002|24011|Engineer, Market Exempt||86353||||||||1||||||G||28|1
9450002|12021|Market President/Chief Revenue
Officer||135000||||||||||1||||I||29|1 
9460027|11111|Mgr, Emergency Care||131248||||||||||1||||I||208|1

The problem is not intentional multi-line data.问题不是故意的多行数据。 Instead it looks like the value of one of the fields sometimes contains end of line characters.相反,看起来其中一个字段的值有时包含行尾字符。 To fix this you need a way to remove (or replace) those characters.要解决此问题,您需要一种删除(或替换)这些字符的方法。 Since it does not appear that the extra end of line characters ever appear in the first or last field it should be easy to fix.由于在第一个或最后一个字段中似乎没有出现额外的行尾字符,因此应该很容易修复。 You can just count the number of fields you have seen and only write the end of line characters when they reach the number of fields you expect per line.您可以只计算您看到的字段数,并且仅在它们达到您期望每行的字段数时才写入行尾字符。

So assuming you have fileref named TXT that points to the original file.所以假设你有一个名为 TXT 的文件引用指向原始文件。 Here is code to convert that into a new file with the extra end of line characters removed.这是将其转换为新文件的代码,其中删除了额外的行尾字符。 It is expecting to have 22 fields per line.预计每行有 22 个字段。

filename fix temp;
data _null_;
  infile txt ;
  file fix;
  input ;
  nwords=countw(_infile_,'|','mq');
  putlog _n_= nwords= _infile_;
  put _infile_ @;
  total+nwords;
  if total>=22 then do; put ; total=0; end;
  else put ' ' @;
run;

We can read the lines back in to check that it worked.我们可以读回这些行以检查它是否有效。

492   data _null_;
493     infile fix;
494     input ;
495     nwords=countw(_infile_,'|','mq');
496     putlog _n_= nwords= _infile_;
497   run;

NOTE: The infile FIX is:
      Filename=...\#LN00081,
      RECFM=V,LRECL=32767,File Size (bytes)=523,
      Last Modified=04Apr2020:11:29:46,
      Create Time=04Apr2020:11:29:46

_N_=1 nwords=22 9450007|23023|Reporter||33100|||||1|||||||||D||49|1
_N_=2 nwords=22 9451007|23023|Reporter||43086||||||1||||||||E||50|1
_N_=3 nwords=22 9462034|11021|Manager Oversee all operations, report to board of director|||||||||12|||||||F||1|12
_N_=4 nwords=22 9460034|43061|Office Assistant Schedule client appointments, enter visit|||||10|||||||||||B||3|10
_N_=5 nwords=22 9451002|24011|Engineer, Market Exempt||86353||||||||1||||||G||28|1
_N_=6 nwords=22 9450002|12021|Market President/Chief Revenue Officer||135000||||||||||1||||I||29|1
_N_=7 nwords=22 9460027|11111|Mgr, Emergency Care||131248||||||||||1||||I||208|1
NOTE: 7 records were read from the infile FIX.
      The minimum record length was 51.
      The maximum record length was 98.

Now you can read the file normally;现在可以正常读取文件了;

data want;
  infile fix dsd dlm='|' truncover;
  input schedule :$12. occ :$8. desc :$300. (v1-v16) (:$8.) v17-v19;
run;

For robustly delimited data, which is unfortunately split within a data value or delimiter, the partial lines can be concatenated into a retained held line.对于在数据值或定界符内拆分的健壮定界数据,可以将部分行连接成保留的保留行。 When the held line contains the expected number of delimiters ( 21 | ), the held line can be pushed back into the input buffer and an input statement can then read from that.当保留的行包含预期数量的分隔符 ( 21 | ) 时,可以将保留的行推回输入缓冲区,然后可以从中读取input语句。

Example:例子:

* create data file for example code;

filename the_data temp;

data _null_;
file the_data;
input;
put _infile_;
datalines;
9450007|23023|Reporter||33100|||||1|||||||||D||49|1
9451007|23023|Reporter||43086||||||1||||||||E||50|1
9462034|11021|Manager 
Oversee all operations, report to board of director|||||||||12|||||||F||1|12
9460034|43061|Office Assistant
Schedule client appointments, enter visit|||||10|||||||||||B||3|10
9451002|24011|Engineer, Market Exempt||86353||||||||1||||||G||28|1
9450002|12021|Market President/Chief Revenue
Officer||135000||||||||||1||||I||29|1 
9460027|11111|Mgr, Emergency Care||131248||||||||||1||||I||208|1
;

* manipulate _infile_ associated with external file in order to
* use input statement for data line, and field, split over multiple lines;

data want;
  length schedule $12 occ $8 desc $300;
  length v1-v16 $8 v17-v19 8;
  length heldline $256;

  infile the_data dlm='|' dsd missover;

  * read data file line into input buffer and _infile_;
  input @;

  if missing(heldline) then 
    /* first line or data from prior 'data' was output */
    /* copy the input buffer to a retained variable (in case the 'data' is split/incomplete) */
    heldline = _infile_;      
  else
    /* heal the split */
    heldline = catx (' ', heldline, _infile_);

  putlog // _N_ / '# ' _infile_ / '* ' heldline;

  retain heldline;

  if countc(heldline,'|') = 21 then do;
    _infile_ = heldline;                    /* proper number of delimiters, push heldline into input buffer */
    input @1 schedule: occ: desc: v1-v19;   /* read 'data' */
    OUTPUT;
    heldline = '';                          /* clear heldline */
  end;

  drop heldline;
  format v1-v16 $4. v17-v19 4. v2 $6.;
run;

Note:笔记:

The example will NOT work when the data is read from DATALINES.从 DATALINES 读取数据时,该示例将不起作用。 A DATA Step that assigns a value to the DATALINES infile will cause the input buffer to be truncated to 80 characters.将值分配给 DATALINES infile的 DATA 步将导致输入缓冲区被截断为 80 个字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM