简体   繁体   English

避免在Python编写的html文件中出现不可打印的字符

[英]Avoid non printable character in html file written by Python

I'm trying to convert SPSS syntax files to readable HTML. 我正在尝试将SPSS语法文件转换为可读的HTML。 It's working almost perfectly except that a (single) non printable character is inserted into the HTML file. 除了将(单个)不可打印的字符插入到HTML文件中之外,它几乎可以正常工作。 It doesn't seem to have an ASCII code and looks like a tiny dot. 它似乎没有ASCII码,看起来像一个小点。 And it's causing trouble. 而且正在引起麻烦。

It occurs (only) in the second line of the HTML file, always corresponding to the first line of the original file. 它(仅)出现在HTML文件的第二行中,始终与原始文件的第一行相对应。 Which probably hints at which line(s) of Python cause the problem (please see comments) 哪些可能暗示着Python的哪几行导致了问题(请参阅注释)

The code which seems to cause this is 似乎导致此的代码是

    rfil = open(fil,"r") #rfil =  Read File, original syntax
    wfil = open(txtFil,"w") #wfil =  Write File, HTML output
    #Line below causes problem??
    wfil.write("<ol class='code'>\n<li>") 
    cnt = 0
    for line in rfil:
        if cnt == 0:
            #Line below causes problem??
            wfil.write(line.rstrip("\n").replace("'",'&#39;').replace('"','&#34;')) 
        elif len(line) > 1:
            wfil.write("</li>\n<li>" + line.strip("\n").replace("'",'&#39;').replace('"','&#34;'))
        else:
            wfil.write("<br /><br />")
        cnt += 1
    wfil.write("</li>\n</ol>")
    wfil.close()
    rfil.close()

Screen shot of the result 结果的屏幕截图

在此处输入图片说明

The input file seems to begin with a byte order mark (BOM) , to indicate UTF-8 encoding. 输入文件似乎以字节顺序标记(BOM)开头,以指示UTF-8编码。 You can decode the file to Unicode strings by opening it with 您可以通过以下方式将文件解码为Unicode字符串:

import codecs
rfil = codecs.open(fil, "r", "utf_8_sig")

The utf_8_sig encoding skips the BOM in the beginning. utf_8_sig编码在开始时跳过BOM。

Some programs recognize the BOM, some don't. 有些程序可以识别BOM,有些则不能。 To write the file out without BOM, use 要写出没有BOM的文件,请使用

wfil = codecs.open(txtFil, "w", "utf_8")

What you see is a byte-order mark, or BOM . 您看到的是字节顺序标记或BOM The way you see it , \\xef\\xbb\\xbf , says that the stringgs you work with are actually UTF-8; 您所看到的\\xef\\xbb\\xbf表示您使用的字符串实际上是UTF-8。 you can convert them into proper Unicode ( line.decode('utf-8') ) to make manipulation easier. 您可以将它们转换为适当的Unicode( line.decode('utf-8') ),以line.decode('utf-8')操作。

Then you can augment the logic for the first line so that it safely removes the BOM: 然后,您可以增加第一行的逻辑,以便安全地删除BOM:

for raw_line in rfil:
    line = raw_line.decode('utf-8') # now line is Unicode
    if cnt == 0 and line[0] == '\ufeff':
        line = line[1:] # cut the first character, which is a BOM
    ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM