I have been provided a data file in a format I have never seen. The data do not appear to be in columns, but rather in one long row. I can open the file in Notepad
and see the data. So, the data do not appear to be encrypted.
When I open the data file in Notepad
the row of data wraps back to the to left side of the Notepad
window when I guess the data reach the maximum number of characters that Notepad
allowed in a single row, and then the data continue in a new row.
There might be 10,000 rows of data when I open the file in Notepad
. The data in one of these rows are not aligned with the data in the row above it or below it.
Here are some example data:
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1304 3 0 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 0205 0 3 0
40001 1 5 GGGG 2998 HURG SU111111 95 1.0 F1 4 0805 0 2 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1205 0 2 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1505 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2003 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2303 2 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2703 3 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999
Notice that when I paste the example data here, representing one row in Notepad
, the columns are 'magically' aligned.
I have found that I can open the data file in Excel
and the data are also aligned. I do need to manually assign column boundaries in Excel
however. And Excel
does not allow me to assign a column boundary beyond more-or-less Character Space 123.
Below is SAS
code to read the data file, although this SAS
code does not work correctly. Rather I guess this SAS
code skips some of the data rows. Notice that the variable TT
covers character spaces 125-207, but that there are only 120 characters in most rows. There are more than 120 characters in some rows. This difference in the number of characters among rows I suspect is the reason SAS cannot read this data file correctly.
option linesize = 210 ;
option pagesize = 30 ;
FILENAME myinput 'C:/Users/markm/simple SAS programs/mydata.new' ;
DATA mydata ;
INFILE myinput ;
INPUT
AA 2-9
BB 12-17
CC 18-22
DD $ 24-27
EE 30-33
FF $ 35-38
GG $ 40-47
HH 53-56
II 59-64
JJ $ 66-68
KK $ 70-71
LL 72-78
MM 79-85
NN $ 87-90
OO 91-95
PP 97-104
QQ 105-110
RR 112-120
SS $ 122-123
TT $ 125-207 ;
If I move the cursor to the right one character at a time over the first row of data using the right-arrow key I have to press the right-arrow key twice to move beyond character space 120 in Notepad
.
All of this is telling me there are hidden characters in the data file used to identify the end of a line of data.
I opened the data file in Vim
hoping to see these hidden characters, but did not see anything. Vim
did align the columns correctly when I opened the file. So, Vim
must be seeing these hidden end-of-line characters.
How can I see these end-of-line characters myself? I suspect there is an option in Vim
to reveal the hidden characters.
How can I determine the application that created this data file?
How can I modify the above SAS
code to read this data file correctly?
First off, double check your LRECL. You're missing basically half of your data, which makes me think you're reading in two lines for each line. You show 207 as your maximum line size, which should be under the default 256 LRECL, but seeing a number about 1/2 of the correct number makes me think you've made a mistake there.
Next, figure out if you are seeing basically every other line, or are you seeing the first 44k lines and then a sudden stop. If the latter, you have a DOS EOF character ( 1A
) in the data, and you need to set the IGNOREDOSEOF
option. If the former, then you have either an obvious LRECL problem as above, or you might have a nonobvious LRECL problem caused by unicode characters taking up multiple bytes (try LRECL=32767
and see if that fixes it; also would cause your data to look funny at some point in each line), or you have a weird line terminator problem (though an inconsistent one).
Then, assuming there is a problem with EOL characters (or EOF?), the way you approach this is to see exactly what is in your datafile.
Read in a dummy character, and then put the _infile_
line with hex.
format. For example:
data test;
infile "d:\temp\utf8.txt" lrecl=256 RECFM=f;
input @1 x $1. @;
r = repeat('1234567890',8); *make this appropriate for your LS option in your log;
put r;
put _infile_;
put _infile_ hex512.;
stop; *we want to see just one line here;
run;
In that case i'm reading in 20 long lines, and using hex40.
, as it needs to be exactly double the line length. You can leave the length off ( hex.
) but you'll get some really long lines with tons of blanks if you do that. In your case, lrecl=207
, you should use hex414.
in theory (But might want to make your lrecl 256
and hex512.
just in case). Since we're using RECFM=F
, the idea is to have a LRECL longer than your real line length, so you can see a whole line in one run of this. (If one line doesn't tell you enough about this, use firstobs=
to navigate to a later line, recognizing that if your LRECL is not exactly right for the data, you won't be skipping to the start of a true line, but skipping 256 byte chunks).
That will give you two strings, one the 'visible' string, which may be helpful for seeing what SAS thinks is at what spot, one the hex codes behind the visible string. The hex codes are 2 values per character (as one byte = 2 hex values), assuming you're in an ASCII environment (not a DBCS or Unicode environment). See this page for a list of ASCII codes.
Hex codes to look for:
If this is a Windows/Dos document, you should see CRLF consecutively at ends of lines, ie, 0D0A
in a row, somewhere around 207. If this is a Unix document, you will see just 0A
there. If this is a Mac OS document, you may see LFCR, or 0A0D
. Why would anyone want to be consistent.
You probably will see something, since you're getting some number of lines. (If there was no line terminator, SAS would just give up after the first line.) You are more likely to have one of the following problems:
00
or 40
or 20
between characters (like, every single character has one), you have a DBCS (double byte character set) file - this is what, say, a Chinese or Japanese copy of Windows OS would likely produce. They use two bytes for every character in order to represent the full set of characters in their languages; but even when storing english documnets, they still use the full set - just adding a filler byte basically to still have reasonable ASCII appearance for noncompatible programs (or programs not set up properly, like SAS would be in this case). My gut is that you have a DBCS file, given you're skipping every other line roughly (though not exactly - and you are skipping MORE than that - which makes this a bit odd to me).
Here is how to see the hidden end-of-line characters in gVim 7.4
:
Open gVim 7.4
Open the data file in gVim 7.4
Press the escape
key a few times to access the line editor. Note pressing the escape key
will result in no visible result on the gVim 7.4
window.
Type :set list
at the bottom of the gVim 7.4
window
Press the enter
key
Once I did the above I saw a blue $
at the end of every line, which I assume is an end-of-line hidden character.
Maybe if I am able to remove these blue $
symbols and save the result under a new name SAS
might be able to read that new data file. If I figure this out I will post an update.
EDIT
I tried to modify the instructions posted here by John Black to remove the $, but so far have had no luck: Read csv file with hidden or invisible character ^M
I typed :%s/$//g
which replaced the blue $
with yellow $
. Then I saved the file under a new name and opened the new file with gVim
. But when I typed :set list
the blue $
were still present in the new file.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.