简体   繁体   English

包含隐藏字符的水平格式的数据文件

[英]data file in horizontal format containing hidden characters

I have been provided a data file in a format I have never seen. 提供了我从未见过的格式的数据文件。 The data do not appear to be in columns, but rather in one long row. 数据似乎不在一列中,而是在一长行中。 I can open the file in Notepad and see the data. 我可以在Notepad打开文件并查看数据。 So, the data do not appear to be encrypted. 因此,数据似乎没有被加密。

When I open the data file in Notepad the row of data wraps back to the to left side of the Notepad window when I guess the data reach the maximum number of characters that Notepad allowed in a single row, and then the data continue in a new row. 当我在Notepad打开数据文件时,当我猜测数据达到了Notepad在单行中允许的最大字符数时,该行数据回绕到Notepad窗口的左侧,然后该数据以新的形式继续行。

There might be 10,000 rows of data when I open the file in Notepad . 当我在Notepad打开文件时,可能有10,000行数据。 The data in one of these rows are not aligned with the data in the row above it or below it. 这些行之一中的数据与其上方或下方的行中的数据不对齐。

Here are some example data: 以下是一些示例数据:

40001       1    5 GGGG  2998 HHHH SU111111       95     1.0 F1  4                1304    3        0               0
40001       1    5 GGGG  2998 HHHH SU111111       95     1.0 F1  4                0205             0     3         0
40001       1    5 GGGG  2998 HURG SU111111       95     1.0 F1  4                0805             0     2         0
40001       1    5 GGGG  2998 HHHH SU111111       95     1.0 F1  4                1205             0     2         0
40001       1    5 GGGG  2998 HHHH SU111111       95     1.0 F1  4                1505             0               0
40002       2    8 GGGG  2998 PPPP SK777777     -999     1.0 F3  4                2003             0               0
40002       2    8 GGGG  2998 PPPP SK777777     -999     1.0 F3  4                2303    2        0               0
40002       2    8 GGGG  2998 PPPP SK777777     -999     1.0 F3  4                2703    3        0               0
40002       2    8 GGGG  2998 PPPP SK777777     -999  

Notice that when I paste the example data here, representing one row in Notepad , the columns are 'magically' aligned. 请注意,当我在此处粘贴示例数据(代表Notepad一行)时,各列“神奇地”对齐了。

I have found that I can open the data file in Excel and the data are also aligned. 我发现可以在Excel打开数据文件,并且数据也对齐。 I do need to manually assign column boundaries in Excel however. 我确实需要在Excel手动分配列边界。 And Excel does not allow me to assign a column boundary beyond more-or-less Character Space 123. 而且Excel不允许我分配或多或少的字符空间123以外的列边界。

Below is SAS code to read the data file, although this SAS code does not work correctly. 下面是SAS代码来读取数据文件,虽然这SAS代码不能正常工作。 Rather I guess this SAS code skips some of the data rows. 相反,我猜想这个SAS代码会跳过一些数据行。 Notice that the variable TT covers character spaces 125-207, but that there are only 120 characters in most rows. 请注意,变量TT覆盖了字符空间125-207,但是大多数行中只有120个字符。 There are more than 120 characters in some rows. 在某些行中有超过120个字符。 This difference in the number of characters among rows I suspect is the reason SAS cannot read this data file correctly. 我怀疑行之间的字符数差异是SAS无法正确读取此数据文件的原因。

option linesize = 210 ;
option pagesize =  30 ;

FILENAME myinput  'C:/Users/markm/simple SAS programs/mydata.new' ;

DATA mydata ;

INFILE myinput ;

INPUT

AA       2-9
BB      12-17
CC      18-22
DD   $  24-27
EE      30-33
FF   $  35-38
GG   $  40-47
HH      53-56
II      59-64
JJ   $  66-68
KK   $  70-71
LL      72-78
MM      79-85
NN   $  87-90
OO      91-95
PP     97-104
QQ    105-110
RR    112-120
SS $  122-123
TT $  125-207 ;

If I move the cursor to the right one character at a time over the first row of data using the right-arrow key I have to press the right-arrow key twice to move beyond character space 120 in Notepad . 如果我使用右箭头键一次将光标移到第一行数据上的右一个字符,则必须按两次右箭头键才能移出Notepad字符空间120。

All of this is telling me there are hidden characters in the data file used to identify the end of a line of data. 所有这些都告诉我,数据文件中有隐藏的字符,用于标识一行数据的结尾。

I opened the data file in Vim hoping to see these hidden characters, but did not see anything. 我在Vim打开了数据文件,希望看到这些隐藏的字符,但是什么也没看到。 Vim did align the columns correctly when I opened the file. 打开文件时, Vim确实正确对齐了列。 So, Vim must be seeing these hidden end-of-line characters. 因此, Vim必须看到这些隐藏的行尾字符。

How can I see these end-of-line characters myself? 我自己如何看到这些行尾字符? I suspect there is an option in Vim to reveal the hidden characters. 我怀疑Vim有一个选项可以显示隐藏的字符。

How can I determine the application that created this data file? 如何确定创建此数据文件的应用程序?

How can I modify the above SAS code to read this data file correctly? 如何修改上述SAS代码以正确读取此数据文件?

First off, double check your LRECL. 首先,请仔细检查您的LRECL。 You're missing basically half of your data, which makes me think you're reading in two lines for each line. 您基本上丢失了一半的数据,这让我觉得您正在为每一行分两行阅读。 You show 207 as your maximum line size, which should be under the default 256 LRECL, but seeing a number about 1/2 of the correct number makes me think you've made a mistake there. 您显示207作为最大行大小,应该在默认的256 LRECL下,但是看到正确数字的1/2左右的数字会使我认为您在这里输入了错误。

Next, figure out if you are seeing basically every other line, or are you seeing the first 44k lines and then a sudden stop. 接下来,确定您是否基本上每隔一行看到一条,或者看到的是前44k行然后突然停止。 If the latter, you have a DOS EOF character ( 1A ) in the data, and you need to set the IGNOREDOSEOF option. 如果是后者,则数据中有一个DOS EOF字符( 1A ),并且需要设置IGNOREDOSEOF选项。 If the former, then you have either an obvious LRECL problem as above, or you might have a nonobvious LRECL problem caused by unicode characters taking up multiple bytes (try LRECL=32767 and see if that fixes it; also would cause your data to look funny at some point in each line), or you have a weird line terminator problem (though an inconsistent one). 如果是前者,则可能是上述明显的LRECL问题,或者可能是由Unicode字符占用多个字节引起的非显而易见的LRECL问题(尝试LRECL=32767看看是否可以解决;也将导致数据看起来每行的某个点很有趣),或者您遇到了一个奇怪的行终止符问题(尽管不一致)。

Then, assuming there is a problem with EOL characters (or EOF?), the way you approach this is to see exactly what is in your datafile. 然后,假设EOL字符(或EOF?)存在问题,解决该问题的方法就是准确查看数据文件中的内容。

Read in a dummy character, and then put the _infile_ line with hex. 读取一个虚拟字符,然后将_infile_行放入hex. format. 格式。 For example: 例如:

data test;
    infile "d:\temp\utf8.txt" lrecl=256 RECFM=f;
    input @1 x $1. @;
    r = repeat('1234567890',8); *make this appropriate for your LS option in your log;
    put r;
    put _infile_;
    put _infile_ hex512.;
    stop; *we want to see just one line here;
run;

In that case i'm reading in 20 long lines, and using hex40. 在那种情况下,我要读20行,并使用hex40. , as it needs to be exactly double the line length. ,因为它必须是行长的两倍。 You can leave the length off ( hex. ) but you'll get some really long lines with tons of blanks if you do that. 您可以保留长度( hex. ),但是如果这样做的话,您会得到一些非常长的行,其中包含大量的空白。 In your case, lrecl=207 , you should use hex414. 在您的情况下, lrecl=207 ,您应该使用hex414. in theory (But might want to make your lrecl 256 and hex512. just in case). 从理论上讲(但是为了以防万一,可能要使您的lrecl 256hex512. Since we're using RECFM=F , the idea is to have a LRECL longer than your real line length, so you can see a whole line in one run of this. 由于我们使用的是RECFM=F ,所以我们的想法是使RECFM=F长于实际行的长度,因此您可以一次查看整个行。 (If one line doesn't tell you enough about this, use firstobs= to navigate to a later line, recognizing that if your LRECL is not exactly right for the data, you won't be skipping to the start of a true line, but skipping 256 byte chunks). (如果一行没有告诉您足够多的信息,请使用firstobs=导航至下一行,并确认如果您的LRECL不完全适合该数据,则您不会跳到真实行的开头,但跳过256个字节的块)。

That will give you two strings, one the 'visible' string, which may be helpful for seeing what SAS thinks is at what spot, one the hex codes behind the visible string. 这将为您提供两个字符串,一个为“可见”字符串,这可能有助于查看SAS认为在什么位置,一个为可见字符串后面的十六进制代码。 The hex codes are 2 values per character (as one byte = 2 hex values), assuming you're in an ASCII environment (not a DBCS or Unicode environment). 假设您处于ASCII环境(不是DBCS或Unicode环境),则十六进制代码是每个字符2个值(一个字节= 2个十六进制值)。 See this page for a list of ASCII codes. 请参阅此页面以获取ASCII码列表。

Hex codes to look for: 要查找的十六进制代码:

  • 1A = DOS EOF character. 1A = DOS EOF字符。
  • 0A = LF 0A =低频
  • 0D = CR 0D = CR

If this is a Windows/Dos document, you should see CRLF consecutively at ends of lines, ie, 0D0A in a row, somewhere around 207. If this is a Unix document, you will see just 0A there. 如果是Windows / Dos文档,则应该在行尾连续看到CRLF,即在207左右的位置连续一行0D0A 。如果是Unix文档,则在那里只能看到0A If this is a Mac OS document, you may see LFCR, or 0A0D . 如果这是Mac OS文档,则可能会看到LFCR或0A0D Why would anyone want to be consistent. 为什么有人要保持一致。

You probably will see something, since you're getting some number of lines. 您可能会看到一些东西,因为您得到了一些行。 (If there was no line terminator, SAS would just give up after the first line.) You are more likely to have one of the following problems: (如果没有行终止符,SAS只会在第一行之后放弃。)您更有可能遇到以下问题之一:

  • This is a DBCS file, so all characters really take up more than one byte. 这是一个DBCS文件,因此所有字符实际上占用一个以上的字节。 If you see a lot of 00 or 40 or 20 between characters (like, every single character has one), you have a DBCS (double byte character set) file - this is what, say, a Chinese or Japanese copy of Windows OS would likely produce. 如果你看到很多的004020个字符之间(比如,每一个角色都有一个),你有DBCS(双字节字符集)文件-这是什么,比如说,Windows操作系统的一个中国人或日本人复制的内容可能产生。 They use two bytes for every character in order to represent the full set of characters in their languages; 他们为每个字符使用两个字节,以表示其语言中的完整字符集。 but even when storing english documnets, they still use the full set - just adding a filler byte basically to still have reasonable ASCII appearance for noncompatible programs (or programs not set up properly, like SAS would be in this case). 但是即使存储英语documnet,它们仍会使用完整集-基本上只是添加一个填充字节,以使不兼容的程序(或未正确设置的程序,例如SAS)仍然具有合理的ASCII外观。
  • This is a UTF-8 file, where characters may take multiple bytes (but may not). 这是一个UTF-8文件,其中的字符可能占用多个字节(但可能没有)。 In this case you probably see some 'junk' in the data when viewing it this way, and every so often you get a character that takes up two or three spaces - often entirely full of 'junk' characters. 在这种情况下,以这种方式查看数据时,您可能会在数据中看到一些“垃圾”,并且每隔一段时间,您就会得到一个占用两个或三个空格的字符-通常完全充满“垃圾”字符。 UTF-8 can take between 1 and 4 bytes per character, usually powers of 2 (so 1,2,4) but will look 'normal' for ASCII characters (ie, it takes ASCII and adds a lot, making relatively few changes in the 00-7F range). UTF-8每个字符可以占用1到4个字节,通常为2的幂(所以1,2,4),但是对于ASCII字符来说看起来是“正常”的(即,它需要ASCII并增加很多,因此在00-7F范围)。

My gut is that you have a DBCS file, given you're skipping every other line roughly (though not exactly - and you are skipping MORE than that - which makes this a bit odd to me). 我的直觉是,您有一个DBCS文件,因为您大致跳过了每隔一行(尽管不完全是-并且您要跳过的更多内容-这对我来说有点奇怪)。

Here is how to see the hidden end-of-line characters in gVim 7.4 : 这是在gVim 7.4如何查看隐藏的行尾字符的方法:

  1. Open gVim 7.4 打开gVim 7.4

  2. Open the data file in gVim 7.4 gVim 7.4打开数据文件

  3. Press the escape key a few times to access the line editor. 几次按escape键以访问行编辑器。 Note pressing the escape key 注意按退出键

will result in no visible result on the gVim 7.4 window. 将不会在gVim 7.4窗口上显示可见的结果。

  1. Type :set list at the bottom of the gVim 7.4 window gVim 7.4窗口底部键入:set list

  2. Press the enter key enter

Once I did the above I saw a blue $ at the end of every line, which I assume is an end-of-line hidden character. 完成上述操作后,我会在每行末尾看到一个蓝色的$ ,我认为这是行尾隐藏字符。

Maybe if I am able to remove these blue $ symbols and save the result under a new name SAS might be able to read that new data file. 也许如果我能够删除这些蓝色的$符号并将结果保存为新名称,则SAS可能能够读取该新数据文件。 If I figure this out I will post an update. 如果我知道了这一点,我将发布更新。

EDIT 编辑

I tried to modify the instructions posted here by John Black to remove the $, but so far have had no luck: Read csv file with hidden or invisible character ^M 我试图修改John Black在此处发布的说明以删除$,但是到目前为止还没有运气: 读取带有隐藏或不可见字符^ M的csv文件

I typed :%s/$//g which replaced the blue $ with yellow $ . 我输入:%s/$//g ,将蓝色$替换为黄色$ Then I saved the file under a new name and opened the new file with gVim . 然后,我以新名称保存了文件,并使用gVim打开了新文件。 But when I typed :set list the blue $ were still present in the new file. 但是当我键入:set list ,蓝色$仍然存在于新文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM