简体   繁体   English

在python中,您如何处理域名中的其他编码

[英]In python how do you deal with other encodings in domain names

I'm trying to parse domain names from the Message-ID field of an email that's been loaded from a file and compare it to the domain of the from field to see how well it matches up. 我正在尝试从文件加载的电子邮件的Message-ID字段中解析域名,并将其与from字段的域进行比较,以查看其匹配程度。 Then I compare the distance using nltk.edit_distance() . 然后,我使用nltk.edit_distance()比较距离。

I'm using 我正在使用

re.search('@[\\[\\]\\w+\\.]+',mail['Message-ID']).group()[1:]

but one spam message has the following 但一封垃圾邮件包含以下内容

mail2['Message-ID']
'<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'

So when I try and match that it doesn't return a match in group() 因此,当我尝试匹配时,它不会在group()返回匹配项

I can decode it in Shift_JIS, but don't know what to do with it from there <2011315123.04C6DACE618A7C2763810@これから見えるだろう> 我可以在Shift_JIS中对其进行解码,但是从那里不知道该如何处理<2011315123.04C6DACE618A7C2763810@これから見えるだろう>

I don't want to try and check for every possible character encoding. 我不想尝试检查所有可能的字符编码。

Any ideas of what I should do with it? 我应该怎么做的任何想法?

You can try the chardet project , which uses an algorithm to guess the character encoding: 您可以尝试chardet项目 ,该项目使用一种算法来猜测字符编码:

import chardet

text = '<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7' + \
    '\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'
cset = chardet.detect(text)
print cset
encoding = cset['encoding']
print encoding, text.decode(encoding)

Output: 输出:

{'confidence': 1, 'encoding': 'SHIFT_JIS'}
SHIFT_JIS <2011315123.04C6DACE618A7C2763810@これから見えるだろう>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 你如何相互依赖处理同一模块中的多个Python类? - How do you deal with multiple Python classes in the same module depending on each other? 如何在Python中将UTF-8和其他编码中的字符写入文件? - How do I write UTF-8 and characters in other encodings to file in Python? 如何使用具有相同名称且相互依赖的方法处理python继承? - How to deal with python inheritance with methods that have same names and depend each other? 使用doxygen构建python文档时如何处理名称冲突 - How do I deal with conflicting names when building python docs with doxygen 如何在python中导入您不知道其名称的函数? - How to import a function that you do not know the names of in python? 如何将列名清晰地传递给游标Python / SQLite? - How do you cleanly pass column names into cursor, Python/SQLite? 你如何在 Python 中转义 SQLite 表/列名称的字符串? - How do you escape strings for SQLite table/column names in Python? 如何规范化Python字符串编码 - How to normalize Python string encodings 调试/编码完成后,如何处理print() - How do you deal with print() once you done with debugging/coding 我该如何处理? 试图获得结果以产生主机变量的名称 - How do i deal with this? trying to get the outcome to produce the names of the hostvars
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM