简体   繁体   中英

Reading multiple lines of text data into a single entry

I have seen very similar problems to mine but could not figure out what my problem is. I am trying to read in a pipe delimited text file where some of the entries are in two lines as below. The variables are: ID, OCC, DESCRIPTION, V1-V19, where V1 through V19 different variables. When I run the code without the V1-V19 then it works like a charm but then when add them to it even the 'test' to see if the next line is numeric or not fails to work.

Here is my code,

data sample;

infile myfile  dlm='|' dsd  end=endin ;
length schedule $12 desc $300;
input @1 schedule $ occ $ desc $  v1-v15 $ v16 $ v17-v19;
input @@;
test= notdigit(substrn(left(_infile_),1,1));
If test then do;
  desc=catx(' ',desc,_infile_);
  input;
end;
run;

and here is the test dataset,

9450007|23023|Reporter||33100|||||1|||||||||D||49|1
9451007|23023|Reporter||43086||||||1||||||||E||50|1
9462034|11021|Manager 
Oversee all operations, report to board of director|||||||||12|||||||F||1|12
9460034|43061|Office Assistant
Schedule client appointments, enter visit|||||10|||||||||||B||3|10
9451002|24011|Engineer, Market Exempt||86353||||||||1||||||G||28|1
9450002|12021|Market President/Chief Revenue
Officer||135000||||||||||1||||I||29|1 
9460027|11111|Mgr, Emergency Care||131248||||||||||1||||I||208|1

The problem is not intentional multi-line data. Instead it looks like the value of one of the fields sometimes contains end of line characters. To fix this you need a way to remove (or replace) those characters. Since it does not appear that the extra end of line characters ever appear in the first or last field it should be easy to fix. You can just count the number of fields you have seen and only write the end of line characters when they reach the number of fields you expect per line.

So assuming you have fileref named TXT that points to the original file. Here is code to convert that into a new file with the extra end of line characters removed. It is expecting to have 22 fields per line.

filename fix temp;
data _null_;
  infile txt ;
  file fix;
  input ;
  nwords=countw(_infile_,'|','mq');
  putlog _n_= nwords= _infile_;
  put _infile_ @;
  total+nwords;
  if total>=22 then do; put ; total=0; end;
  else put ' ' @;
run;

We can read the lines back in to check that it worked.

492   data _null_;
493     infile fix;
494     input ;
495     nwords=countw(_infile_,'|','mq');
496     putlog _n_= nwords= _infile_;
497   run;

NOTE: The infile FIX is:
      Filename=...\#LN00081,
      RECFM=V,LRECL=32767,File Size (bytes)=523,
      Last Modified=04Apr2020:11:29:46,
      Create Time=04Apr2020:11:29:46

_N_=1 nwords=22 9450007|23023|Reporter||33100|||||1|||||||||D||49|1
_N_=2 nwords=22 9451007|23023|Reporter||43086||||||1||||||||E||50|1
_N_=3 nwords=22 9462034|11021|Manager Oversee all operations, report to board of director|||||||||12|||||||F||1|12
_N_=4 nwords=22 9460034|43061|Office Assistant Schedule client appointments, enter visit|||||10|||||||||||B||3|10
_N_=5 nwords=22 9451002|24011|Engineer, Market Exempt||86353||||||||1||||||G||28|1
_N_=6 nwords=22 9450002|12021|Market President/Chief Revenue Officer||135000||||||||||1||||I||29|1
_N_=7 nwords=22 9460027|11111|Mgr, Emergency Care||131248||||||||||1||||I||208|1
NOTE: 7 records were read from the infile FIX.
      The minimum record length was 51.
      The maximum record length was 98.

Now you can read the file normally;

data want;
  infile fix dsd dlm='|' truncover;
  input schedule :$12. occ :$8. desc :$300. (v1-v16) (:$8.) v17-v19;
run;

For robustly delimited data, which is unfortunately split within a data value or delimiter, the partial lines can be concatenated into a retained held line. When the held line contains the expected number of delimiters ( 21 | ), the held line can be pushed back into the input buffer and an input statement can then read from that.

Example:

* create data file for example code;

filename the_data temp;

data _null_;
file the_data;
input;
put _infile_;
datalines;
9450007|23023|Reporter||33100|||||1|||||||||D||49|1
9451007|23023|Reporter||43086||||||1||||||||E||50|1
9462034|11021|Manager 
Oversee all operations, report to board of director|||||||||12|||||||F||1|12
9460034|43061|Office Assistant
Schedule client appointments, enter visit|||||10|||||||||||B||3|10
9451002|24011|Engineer, Market Exempt||86353||||||||1||||||G||28|1
9450002|12021|Market President/Chief Revenue
Officer||135000||||||||||1||||I||29|1 
9460027|11111|Mgr, Emergency Care||131248||||||||||1||||I||208|1
;

* manipulate _infile_ associated with external file in order to
* use input statement for data line, and field, split over multiple lines;

data want;
  length schedule $12 occ $8 desc $300;
  length v1-v16 $8 v17-v19 8;
  length heldline $256;

  infile the_data dlm='|' dsd missover;

  * read data file line into input buffer and _infile_;
  input @;

  if missing(heldline) then 
    /* first line or data from prior 'data' was output */
    /* copy the input buffer to a retained variable (in case the 'data' is split/incomplete) */
    heldline = _infile_;      
  else
    /* heal the split */
    heldline = catx (' ', heldline, _infile_);

  putlog // _N_ / '# ' _infile_ / '* ' heldline;

  retain heldline;

  if countc(heldline,'|') = 21 then do;
    _infile_ = heldline;                    /* proper number of delimiters, push heldline into input buffer */
    input @1 schedule: occ: desc: v1-v19;   /* read 'data' */
    OUTPUT;
    heldline = '';                          /* clear heldline */
  end;

  drop heldline;
  format v1-v16 $4. v17-v19 4. v2 $6.;
run;

Note:

The example will NOT work when the data is read from DATALINES. A DATA Step that assigns a value to the DATALINES infile will cause the input buffer to be truncated to 80 characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM