Reading HTM file: mysterious white space around every character

Question

I have an HTM file. When I open it directly in Notepad, it looks like this:

<HTML>
<BODY BGCOLOR=#FFFFFF BGPROPERTIES=FIXED>
<FONT 000000 FACE=ARIAL SIZE=3>
<HEAD>

When I attempt to do this in Perl:

open (my $fh, '<', $filename) or die "Error opening file! $!";
chomp(my @lines = <$fh>);
close $fh;

Each line in the Perl array now has these extra spaces and looks like this:

< H T M L >    
< B O D Y   B G C O L O R = # F F F F F F   B G P R O P E R T I E S = F I X E D >    
< F O N T   0 0 0 0 0 0   F A C E = A R I A L   S I Z E = 3 >    
< H E A D >

Any ideas on where the problem is?

CLARIFICATION: These are not my HTM files, so I have no control over them or their creation. I receive the file and must process the contents. Various attacks like s/ (?= |\\w)//g don't seem to affect this mysterious whitespace.

The output is being generated this way:

foreach (@lines) {
    $line .= "$_\n";
}

open( $fh, '>', 'output-file.txt' ) or die "Could not open file $!";
print $fh $line;
close $fh;

Answer 1

There is no text but encoded text. Every file is written with one specific character encoding and must be read with that same encoding.

HTML files are formatted text. They have a document encoding—the one the file is written with. The document "value" is a sequence of Unicode characters. If the file doesn't use a Unicode encoding, characters can be represented as numeric character entities (eg, 🚲 instead of 🚲). They also have a mechanism to indicate the document encoding internally ( meta charset ), but, apparently, that was not used.

When you receive a text file, you must also have knowledge of which encoding was used to write it. If you don't have that, it's a failed communication. (Web servers and browsers prevent that by telling each other which encoding they are using with the HTTP Content-Type heading. Unfortunately, with programs dropping files into the filesystem of a single system, there has been too much reliance on defaults or "detection" [informed guessing].)

As others have said, it looks like your text renderer is coping with UTF-16 encoded text by emitting a space where it sees a zero byte. (I wonder how it would deal with 🚲.) People are asking for a hex dump of your bytes so they can improve the guess. If it is consistent with UTF-16, that would be a highly probable guess, even with such a small sample.

The solution is simple: Confirm with the sender that the encoding is UTF-16 and then read it as UTF-16LE or UTF-16BE depending on the byte ordering. The byte ordering is easy to detect, given the knowledge that the encoding is UTF-16. So, slurp the file as a byte string and decode the bytes into a text string using Encode::Unicode .

Answer 2

I applied s/\\x0//g which apparently transformed many nulls into Chinese characters. I cleaned these out with s/[^[:ascii:]]+//g; . It isn't ideal but seems to work.

Reading HTM file: mysterious white space around every character

Question

2 answers

solution1
0 2017-03-11 01:58:35

solution2
-1 2017-03-11 00:33:57

Reading HTM file: mysterious white space around every character

Question

2 answers

solution1 0 2017-03-11 01:58:35

solution2 -1 2017-03-11 00:33:57

solution1
0 2017-03-11 01:58:35

solution2
-1 2017-03-11 00:33:57