[英]How to properly decode and re-encode a file name from latin-1 in order to open it
I am trying to open a local file in selenium using 我正在尝试使用selenium打开本地文件
driver.get( ('file://' + file ))
where file
is the file name. 其中
file
是文件名。
It seems the file name has latin-1 characters in it: 文件名中似乎包含latin-1个字符:
..\\\\PRODUCT NAME – Something Something.html
.. \\\\产品名称-Something.html
when I use file.decode('latin-1')
, I get: 当我使用
file.decode('latin-1')
,我得到:
..\\\\PRODUCT NAME \\x96 Something Something.html
.. \\\\产品名称\\ x96 Something.html
If I simply use driver.get( ('file://' + file ))
, I get: 如果我只使用
driver.get( ('file://' + file ))
,我会得到:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 81: invalid start byte
If I use driver.get( ('file://' + file.decode('latin_1') ))
, I get an error stating that the file is not found: 如果我使用
driver.get( ('file://' + file.decode('latin_1') ))
, driver.get( ('file://' + file.decode('latin_1') ))
收到一条错误消息,指出未找到该文件:
...fileNotFound&u=file%3A///C%3A/PRODUCT%20NAME%20%C2%96%20Something%20Something.html.
I'm not 100% sure what encoding it's expecting, but I've tried re-encoding the file name as unicode and utf-8 with no luck (same error - says file isn't found). 我不是100%不确定期望的编码方式,但是我尝试将文件名重新编码为unicode和utf-8,但没有运气(相同的错误-表示找不到文件)。
Any idea how I could solve this problem? 知道如何解决这个问题吗? Renaming the file itself won't be an option unfortunately.
不幸的是,重命名文件本身并不是一种选择。 I want to properly decode it, then re-encode it (the encoding sandwich others have recommended).
我想正确解码它,然后重新编码(其他人推荐的编码三明治)。
Figured it out. 弄清楚了。 The problem is that the html document has
charset='utf-8'
, and it isn't strictly true. 问题在于html文档具有
charset='utf-8'
,并且严格意义上并非如此。 The title contains ' – ', which is cp1252
encoded. 标题中包含“ –”,它是
cp1252
编码的。
I've solved it like this by adding an exception and character correcting function: 我通过添加异常和字符校正功能来解决此问题:
def selenium_extractor(file):
def charset_correct(filename):
try:
return ''.join([char.decode('cp1252') for char in filename])
except:
# EXTEND WITH FURTHER DECODINGS
traceback.print_exc()
sys.exit(1)
driver = webdriver.Firefox()
try:
driver.get( 'file://' + file)
driver.quit()
except UnicodeError:
driver.get( 'file://' + charset_correct(file))
driver.quit()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.