如何正确地解码和重新编码来自latin-1的文件名以打开它

Question

I am trying to open a local file in selenium using 我正在尝试使用selenium打开本地文件

driver.get( ('file://' + file ))

where file is the file name. 其中file是文件名。

It seems the file name has latin-1 characters in it: 文件名中似乎包含latin-1个字符：

..\\\\PRODUCT NAME – Something Something.html .. \\\\产品名称-Something.html

when I use file.decode('latin-1') , I get: 当我使用file.decode('latin-1') ，我得到：

..\\\\PRODUCT NAME \\x96 Something Something.html .. \\\\产品名称\\ x96 Something.html

If I simply use driver.get( ('file://' + file )) , I get: 如果我只使用driver.get( ('file://' + file )) ，我会得到：

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 81: invalid start byte

If I use driver.get( ('file://' + file.decode('latin_1') )) , I get an error stating that the file is not found: 如果我使用driver.get( ('file://' + file.decode('latin_1') )) ， driver.get( ('file://' + file.decode('latin_1') ))收到一条错误消息，指出未找到该文件：

...fileNotFound&u=file%3A///C%3A/PRODUCT%20NAME%20%C2%96%20Something%20Something.html.

I'm not 100% sure what encoding it's expecting, but I've tried re-encoding the file name as unicode and utf-8 with no luck (same error - says file isn't found). 我不是100％不确定期望的编码方式，但是我尝试将文件名重新编码为unicode和utf-8，但没有运气（相同的错误-表示找不到文件）。

Any idea how I could solve this problem? 知道如何解决这个问题吗？ Renaming the file itself won't be an option unfortunately. 不幸的是，重命名文件本身并不是一种选择。 I want to properly decode it, then re-encode it (the encoding sandwich others have recommended). 我想正确解码它，然后重新编码（其他人推荐的编码三明治）。

Answer 1

Figured it out. 弄清楚了。 The problem is that the html document has charset='utf-8' , and it isn't strictly true. 问题在于html文档具有charset='utf-8' ，并且严格意义上并非如此。 The title contains ' – ', which is cp1252 encoded. 标题中包含“ –”，它是cp1252编码的。

I've solved it like this by adding an exception and character correcting function: 我通过添加异常和字符校正功能来解决此问题：

def selenium_extractor(file):

    def charset_correct(filename):
        try:
            return ''.join([char.decode('cp1252') for char in filename])
        except:
            # EXTEND WITH FURTHER DECODINGS
            traceback.print_exc()
            sys.exit(1)


    driver = webdriver.Firefox()

    try:
        driver.get( 'file://' + file)
        driver.quit()

    except UnicodeError:
        driver.get( 'file://' + charset_correct(file))
        driver.quit()

如何正确地解码和重新编码来自latin-1的文件名以打开它

问题描述

1 个解决方案

解决方案1
0 2018-10-30 15:40:12

如何正确地解码和重新编码来自latin-1的文件名以打开它

问题描述

1 个解决方案

解决方案1 0 2018-10-30 15:40:12

解决方案1
0 2018-10-30 15:40:12