简体   繁体   中英

data file in horizontal format containing hidden characters

I have been provided a data file in a format I have never seen. The data do not appear to be in columns, but rather in one long row. I can open the file in Notepad and see the data. So, the data do not appear to be encrypted.

When I open the data file in Notepad the row of data wraps back to the to left side of the Notepad window when I guess the data reach the maximum number of characters that Notepad allowed in a single row, and then the data continue in a new row.

There might be 10,000 rows of data when I open the file in Notepad . The data in one of these rows are not aligned with the data in the row above it or below it.

Here are some example data:

40001       1    5 GGGG  2998 HHHH SU111111       95     1.0 F1  4                1304    3        0               0
40001       1    5 GGGG  2998 HHHH SU111111       95     1.0 F1  4                0205             0     3         0
40001       1    5 GGGG  2998 HURG SU111111       95     1.0 F1  4                0805             0     2         0
40001       1    5 GGGG  2998 HHHH SU111111       95     1.0 F1  4                1205             0     2         0
40001       1    5 GGGG  2998 HHHH SU111111       95     1.0 F1  4                1505             0               0
40002       2    8 GGGG  2998 PPPP SK777777     -999     1.0 F3  4                2003             0               0
40002       2    8 GGGG  2998 PPPP SK777777     -999     1.0 F3  4                2303    2        0               0
40002       2    8 GGGG  2998 PPPP SK777777     -999     1.0 F3  4                2703    3        0               0
40002       2    8 GGGG  2998 PPPP SK777777     -999  

Notice that when I paste the example data here, representing one row in Notepad , the columns are 'magically' aligned.

I have found that I can open the data file in Excel and the data are also aligned. I do need to manually assign column boundaries in Excel however. And Excel does not allow me to assign a column boundary beyond more-or-less Character Space 123.

Below is SAS code to read the data file, although this SAS code does not work correctly. Rather I guess this SAS code skips some of the data rows. Notice that the variable TT covers character spaces 125-207, but that there are only 120 characters in most rows. There are more than 120 characters in some rows. This difference in the number of characters among rows I suspect is the reason SAS cannot read this data file correctly.

option linesize = 210 ;
option pagesize =  30 ;

FILENAME myinput  'C:/Users/markm/simple SAS programs/mydata.new' ;

DATA mydata ;

INFILE myinput ;

INPUT

AA       2-9
BB      12-17
CC      18-22
DD   $  24-27
EE      30-33
FF   $  35-38
GG   $  40-47
HH      53-56
II      59-64
JJ   $  66-68
KK   $  70-71
LL      72-78
MM      79-85
NN   $  87-90
OO      91-95
PP     97-104
QQ    105-110
RR    112-120
SS $  122-123
TT $  125-207 ;

If I move the cursor to the right one character at a time over the first row of data using the right-arrow key I have to press the right-arrow key twice to move beyond character space 120 in Notepad .

All of this is telling me there are hidden characters in the data file used to identify the end of a line of data.

I opened the data file in Vim hoping to see these hidden characters, but did not see anything. Vim did align the columns correctly when I opened the file. So, Vim must be seeing these hidden end-of-line characters.

How can I see these end-of-line characters myself? I suspect there is an option in Vim to reveal the hidden characters.

How can I determine the application that created this data file?

How can I modify the above SAS code to read this data file correctly?

First off, double check your LRECL. You're missing basically half of your data, which makes me think you're reading in two lines for each line. You show 207 as your maximum line size, which should be under the default 256 LRECL, but seeing a number about 1/2 of the correct number makes me think you've made a mistake there.

Next, figure out if you are seeing basically every other line, or are you seeing the first 44k lines and then a sudden stop. If the latter, you have a DOS EOF character ( 1A ) in the data, and you need to set the IGNOREDOSEOF option. If the former, then you have either an obvious LRECL problem as above, or you might have a nonobvious LRECL problem caused by unicode characters taking up multiple bytes (try LRECL=32767 and see if that fixes it; also would cause your data to look funny at some point in each line), or you have a weird line terminator problem (though an inconsistent one).

Then, assuming there is a problem with EOL characters (or EOF?), the way you approach this is to see exactly what is in your datafile.

Read in a dummy character, and then put the _infile_ line with hex. format. For example:

data test;
    infile "d:\temp\utf8.txt" lrecl=256 RECFM=f;
    input @1 x $1. @;
    r = repeat('1234567890',8); *make this appropriate for your LS option in your log;
    put r;
    put _infile_;
    put _infile_ hex512.;
    stop; *we want to see just one line here;
run;

In that case i'm reading in 20 long lines, and using hex40. , as it needs to be exactly double the line length. You can leave the length off ( hex. ) but you'll get some really long lines with tons of blanks if you do that. In your case, lrecl=207 , you should use hex414. in theory (But might want to make your lrecl 256 and hex512. just in case). Since we're using RECFM=F , the idea is to have a LRECL longer than your real line length, so you can see a whole line in one run of this. (If one line doesn't tell you enough about this, use firstobs= to navigate to a later line, recognizing that if your LRECL is not exactly right for the data, you won't be skipping to the start of a true line, but skipping 256 byte chunks).

That will give you two strings, one the 'visible' string, which may be helpful for seeing what SAS thinks is at what spot, one the hex codes behind the visible string. The hex codes are 2 values per character (as one byte = 2 hex values), assuming you're in an ASCII environment (not a DBCS or Unicode environment). See this page for a list of ASCII codes.

Hex codes to look for:

  • 1A = DOS EOF character.
  • 0A = LF
  • 0D = CR

If this is a Windows/Dos document, you should see CRLF consecutively at ends of lines, ie, 0D0A in a row, somewhere around 207. If this is a Unix document, you will see just 0A there. If this is a Mac OS document, you may see LFCR, or 0A0D . Why would anyone want to be consistent.

You probably will see something, since you're getting some number of lines. (If there was no line terminator, SAS would just give up after the first line.) You are more likely to have one of the following problems:

  • This is a DBCS file, so all characters really take up more than one byte. If you see a lot of 00 or 40 or 20 between characters (like, every single character has one), you have a DBCS (double byte character set) file - this is what, say, a Chinese or Japanese copy of Windows OS would likely produce. They use two bytes for every character in order to represent the full set of characters in their languages; but even when storing english documnets, they still use the full set - just adding a filler byte basically to still have reasonable ASCII appearance for noncompatible programs (or programs not set up properly, like SAS would be in this case).
  • This is a UTF-8 file, where characters may take multiple bytes (but may not). In this case you probably see some 'junk' in the data when viewing it this way, and every so often you get a character that takes up two or three spaces - often entirely full of 'junk' characters. UTF-8 can take between 1 and 4 bytes per character, usually powers of 2 (so 1,2,4) but will look 'normal' for ASCII characters (ie, it takes ASCII and adds a lot, making relatively few changes in the 00-7F range).

My gut is that you have a DBCS file, given you're skipping every other line roughly (though not exactly - and you are skipping MORE than that - which makes this a bit odd to me).

Here is how to see the hidden end-of-line characters in gVim 7.4 :

  1. Open gVim 7.4

  2. Open the data file in gVim 7.4

  3. Press the escape key a few times to access the line editor. Note pressing the escape key

will result in no visible result on the gVim 7.4 window.

  1. Type :set list at the bottom of the gVim 7.4 window

  2. Press the enter key

Once I did the above I saw a blue $ at the end of every line, which I assume is an end-of-line hidden character.

Maybe if I am able to remove these blue $ symbols and save the result under a new name SAS might be able to read that new data file. If I figure this out I will post an update.

EDIT

I tried to modify the instructions posted here by John Black to remove the $, but so far have had no luck: Read csv file with hidden or invisible character ^M

I typed :%s/$//g which replaced the blue $ with yellow $ . Then I saved the file under a new name and opened the new file with gVim . But when I typed :set list the blue $ were still present in the new file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM