简体   繁体   English

读取HTM文件:每个字符周围的神秘空白

[英]Reading HTM file: mysterious white space around every character

I have an HTM file. 我有一个HTM文件。 When I open it directly in Notepad, it looks like this: 当我直接在记事本中打开它时,它看起来像这样:

<HTML>
<BODY BGCOLOR=#FFFFFF BGPROPERTIES=FIXED>
<FONT 000000 FACE=ARIAL SIZE=3>
<HEAD>

When I attempt to do this in Perl: 当我尝试在Perl中执行此操作时:

open (my $fh, '<', $filename) or die "Error opening file! $!";
chomp(my @lines = <$fh>);
close $fh;

Each line in the Perl array now has these extra spaces and looks like this: 现在,Perl数组中的每一行都具有这些额外的空格,如下所示:

< H T M L >    
< B O D Y   B G C O L O R = # F F F F F F   B G P R O P E R T I E S = F I X E D >    
< F O N T   0 0 0 0 0 0   F A C E = A R I A L   S I Z E = 3 >    
< H E A D >   

Any ideas on where the problem is? 关于问题所在的任何想法?

CLARIFICATION: These are not my HTM files, so I have no control over them or their creation. 澄清:这些不是我的HTM文件,因此我无法控制它们或它们的创建。 I receive the file and must process the contents. 我收到文件,必须处理其中的内容。 Various attacks like s/ (?= |\\w)//g don't seem to affect this mysterious whitespace. 诸如s/ (?= |\\w)//gs/ (?= |\\w)//g各种攻击似乎都不会影响这个神秘的空格。

The output is being generated this way: 通过以下方式生成输出:

foreach (@lines) {
    $line .= "$_\n";
}

open( $fh, '>', 'output-file.txt' ) or die "Could not open file $!";
print $fh $line;
close $fh;

There is no text but encoded text. 没有文字,只有编码文字。 Every file is written with one specific character encoding and must be read with that same encoding. 每个文件都使用一种特定的字符编码编写,并且必须使用相同的编码进行读取。

HTML files are formatted text. HTML文件是格式化的文本。 They have a document encoding—the one the file is written with. 它们具有一种文档编码,即与文件一起编码的文档。 The document "value" is a sequence of Unicode characters. 文档“值”是Unicode字符序列。 If the file doesn't use a Unicode encoding, characters can be represented as numeric character entities (eg, &#x1f6b2; instead of 🚲). 如果文件未使用Unicode编码,则可以将字符表示为数字字符实体(例如&#x1f6b2;而不是🚲)。 They also have a mechanism to indicate the document encoding internally ( meta charset ), but, apparently, that was not used. 它们还具有一种指示内部编码文档的机制( meta charset ),但显然没有使用。

When you receive a text file, you must also have knowledge of which encoding was used to write it. 收到文本文件时,您还必须了解用于编写文本的编码。 If you don't have that, it's a failed communication. 如果没有,则说明通信失败。 (Web servers and browsers prevent that by telling each other which encoding they are using with the HTTP Content-Type heading. Unfortunately, with programs dropping files into the filesystem of a single system, there has been too much reliance on defaults or "detection" [informed guessing].) (Web服务器和浏览器通过使用HTTP Content-Type标题相互告诉对方正在使用哪种编码来进行预防。不幸的是,由于程序将文件拖放到单个系统的文件系统中,因此过分依赖默认值或“检测” [明智的猜测]。

As others have said, it looks like your text renderer is coping with UTF-16 encoded text by emitting a space where it sees a zero byte. 正如其他人所说,您的文本呈现器似乎在UTF-16编码的文本上显示了一个零字节,从而解决了这一问题。 (I wonder how it would deal with 🚲.) People are asking for a hex dump of your bytes so they can improve the guess. (我想知道如何处理🚲。)人们要求对您的字节进行十六进制转储,以便改善猜测。 If it is consistent with UTF-16, that would be a highly probable guess, even with such a small sample. 如果它与UTF-16一致,那将是一个极有可能的猜测,即使样本量如此之小。

The solution is simple: Confirm with the sender that the encoding is UTF-16 and then read it as UTF-16LE or UTF-16BE depending on the byte ordering. 解决方案很简单:与发送方确认编码为UTF-16,然后根据字节顺序将其读取为UTF-16LE或UTF-16BE。 The byte ordering is easy to detect, given the knowledge that the encoding is UTF-16. 在知道编码为UTF-16的情况下,字节顺序很容易检测。 So, slurp the file as a byte string and decode the bytes into a text string using Encode::Unicode . 因此,将文件作为字节字符串提取,然后使用Encode :: Unicode将字节解码为文本字符串。

I applied s/\\x0//g which apparently transformed many nulls into Chinese characters. 我使用s/\\x0//g显然将许多空值转换为汉字。 I cleaned these out with s/[^[:ascii:]]+//g; 我用s/[^[:ascii:]]+//g;清除了这些内容s/[^[:ascii:]]+//g; . It isn't ideal but seems to work. 这不是理想的方法,但是似乎可行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM