[英]How do I categorise non-english email using procmail and command line tools?
I am subscribed to a mail list where some of the messages are non-english which I cannot understand.我订阅了一个邮件列表,其中一些消息是非英语的,我无法理解。
How do I filter the non-english messages to /dev/null
using procmail
and/or command line tools?如何使用
procmail
和/或命令行工具将非英语消息过滤到/dev/null
?
I use procmail
to filter my email, so ideally any alternative tool would also require a procmail
recipe.我使用
procmail
过滤我的 email,因此理想情况下,任何替代工具也需要procmail
配方。
I'd prefer not to have to train my own language models.我宁愿不必训练自己的语言模型。
One way is to use the perl TextCat package from Gertjan van Noord .一种方法是使用Gertjan van Noord的 perl TextCat package 。
The text_cat
script outputs the most likely language for the mail. text_cat
脚本为邮件输出最可能的语言。 This recipe assumes text_cat
has been installed under /usr/local/bin
.这个秘籍假设
text_cat
已经安装在/usr/local/bin
下。
Here is a simple procmail
recipe to call the text_cat
script:这是调用
text_cat
脚本的简单procmail
方法:
:0
* ^Subject.*Jobs.*Board
{
LANG_=`/usr/local/bin/text_cat`
:0
* ! LANG ?? ^english$
/dev/null
:0
jobs/
}
I've been running text_cat for a few years.我已经运行 text_cat 几年了。 There haven't been any non-english messages classified as english, that is, no false-positives.
没有任何非英语消息被归类为英语,也就是说,没有误报。 I've not been rigorous about checking for false-negatives.
我对检查假阴性并不严格。
A second way, as mentioned by tripleee in a comment, is to use the language categorisation provided by spamassassin which also uses the text_cat script.第二种方法,正如Tripleee在评论中提到的那样,是使用spamassassin提供的语言分类,它也使用 text_cat 脚本。 Spamassassin will unwrap any MIME transfer encodings which the vanilla text_cat version above won't.
Spamassassin 将解开任何 MIME 传输编码,而上面的 vanilla text_cat 版本不会。
Here is an incompletely tested procmail
recipe for filtering on the spamassassin X-Spam-Languages
header:这是一个未完全测试的
procmail
配方,用于过滤 spamassassin X-Spam-Languages
header:
:0
* ^Subject.*Jobs.*Board
{
# Delete non-english language emails using spamassassin header
# Test for not X-Spam-Languages: en
:0
* !^X-Spam-Languages: en$
foreign/
# Save english language mails in folder
:0
jobs/
}
Warning: spamassassin will occasionally provide multiple language categorisations like so:警告: spamassassin 偶尔会提供多种语言分类,如下所示:
X-Spam-Languages: en da ro
which the above recipe does not account for.上面的食谱没有解释。
Spamassassin Language Categorisation Configuration Spamassassin 语言分类配置
Edit /etc/spamassassin/v310.pre
and uncomment the following line:编辑
/etc/spamassassin/v310.pre
并取消注释以下行:
loadplugin Mail::SpamAssassin::Plugin::TextCat
Configure the plugin in /etc/spamassassin/local.cf
:在
/etc/spamassassin/local.cf
中配置插件:
ok_languages en # I understand english
inactive_languages '' # Enable all languages
add_header all Languages _LANGUAGES_
# score UNWANTED_LANGUAGE_BODY 5 # Increase score - not necessary and not recommended
This recipe was incompletely tested with spamassassin version 3.4.2.此配方未使用 spamassassin 版本 3.4.2 进行完全测试。
To adapt these answers to excluding a different language would involve substituting the other language for english
in the first case and substituting the other 2 character language code for en
in the second case.为了使这些答案适应排除不同的语言,需要在第一种情况下用另一种语言代替
english
,在第二种情况下用其他 2 个字符的语言代码代替en
。
Many modern email clients identify the character set of the email message, though not usually its language.许多现代 email 客户端识别 email 消息的字符集,但通常不是它的语言。 If you want to discard Japanese, Chinese, Korean, and Russian messages, you could try something like
如果您想丢弃日文、中文、韩文和俄文消息,您可以尝试类似
:0HB
* ^Content-type:[ ]*text/[/;]*;[ ]*charset="?(iso-2022|ks-c|gb|koi|cp-1251)
foreign
Because some clients forget to change the character set when they write in English, this is likely to produce some false positives, so I recommend saving to a folder and reviewing it periodically.因为有些客户在写英文的时候忘记改字符集,这很可能会产生一些误报,所以我建议保存到一个文件夹并定期查看。 The opposite problem is harder;
相反的问题更难。 many foreign languages use the same character set as English, and thus can't be identified like this with any reliability.
许多外语使用与英语相同的字符集,因此不能像这样可靠地识别。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.