In python how do you deal with other encodings in domain names

Question

I'm trying to parse domain names from the Message-ID field of an email that's been loaded from a file and compare it to the domain of the from field to see how well it matches up. Then I compare the distance using nltk.edit_distance() .

I'm using

re.search('@[\\[\\]\\w+\\.]+',mail['Message-ID']).group()[1:]

but one spam message has the following

mail2['Message-ID']
'<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'

So when I try and match that it doesn't return a match in group()

I can decode it in Shift_JIS, but don't know what to do with it from there <2011315123.04C6DACE618A7C2763810@これから見えるだろう>

I don't want to try and check for every possible character encoding.

Any ideas of what I should do with it?

Answer 1

You can try the chardet project , which uses an algorithm to guess the character encoding:

import chardet

text = '<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7' + \
    '\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'
cset = chardet.detect(text)
print cset
encoding = cset['encoding']
print encoding, text.decode(encoding)

Output:

{'confidence': 1, 'encoding': 'SHIFT_JIS'}
SHIFT_JIS <2011315123.04C6DACE618A7C2763810@これから見えるだろう>

In python how do you deal with other encodings in domain names

Question

1 answers

solution1
1 ACCPTED 2011-04-24 01:27:02

In python how do you deal with other encodings in domain names

Question

1 answers

solution1 1 ACCPTED 2011-04-24 01:27:02

solution1
1 ACCPTED 2011-04-24 01:27:02