简体   繁体   English

如何使用 procmail 和命令行工具对非英语 email 进行分类?

[英]How do I categorise non-english email using procmail and command line tools?

I am subscribed to a mail list where some of the messages are non-english which I cannot understand.我订阅了一个邮件列表,其中一些消息是非英语的,我无法理解。

How do I filter the non-english messages to /dev/null using procmail and/or command line tools?如何使用procmail和/或命令行工具将非英语消息过滤到/dev/null

I use procmail to filter my email, so ideally any alternative tool would also require a procmail recipe.我使用procmail过滤我的 email,因此理想情况下,任何替代工具也需要procmail配方。

I'd prefer not to have to train my own language models.我宁愿不必训练自己的语言模型。

One way is to use the perl TextCat package from Gertjan van Noord .一种方法是使用Gertjan van Noord的 perl TextCat package 。

The text_cat script outputs the most likely language for the mail. text_cat脚本为邮件输出最可能的语言。 This recipe assumes text_cat has been installed under /usr/local/bin .这个秘籍假设text_cat已经安装在/usr/local/bin下。

Here is a simple procmail recipe to call the text_cat script:这是调用text_cat脚本的简单procmail方法:

:0
* ^Subject.*Jobs.*Board
{
    LANG_=`/usr/local/bin/text_cat`

    :0
    * ! LANG ?? ^english$
    /dev/null

    :0
    jobs/
}

I've been running text_cat for a few years.我已经运行 text_cat 几年了。 There haven't been any non-english messages classified as english, that is, no false-positives.没有任何非英语消息被归类为英语,也就是说,没有误报。 I've not been rigorous about checking for false-negatives.我对检查假阴性并不严格。


A second way, as mentioned by tripleee in a comment, is to use the language categorisation provided by spamassassin which also uses the text_cat script.第二种方法,正如Tripleee在评论中提到的那样,是使用spamassassin提供的语言分类,它也使用 text_cat 脚本。 Spamassassin will unwrap any MIME transfer encodings which the vanilla text_cat version above won't. Spamassassin 将解开任何 MIME 传输编码,而上面的 vanilla text_cat 版本不会。

Here is an incompletely tested procmail recipe for filtering on the spamassassin X-Spam-Languages header:这是一个未完全测试procmail配方,用于过滤 spamassassin X-Spam-Languages header:

:0
* ^Subject.*Jobs.*Board
{    
    # Delete non-english language emails using spamassassin header
    # Test for not X-Spam-Languages: en
    :0
    * !^X-Spam-Languages: en$
    foreign/

    # Save english language mails in folder
    :0
    jobs/
}

Warning: spamassassin will occasionally provide multiple language categorisations like so:警告: spamassassin 偶尔会提供多种语言分类,如下所示:

X-Spam-Languages: en da ro

which the above recipe does not account for.上面的食谱没有解释。

Spamassassin Language Categorisation Configuration Spamassassin 语言分类配置

Edit /etc/spamassassin/v310.pre and uncomment the following line:编辑/etc/spamassassin/v310.pre并取消注释以下行:

loadplugin Mail::SpamAssassin::Plugin::TextCat

Configure the plugin in /etc/spamassassin/local.cf :/etc/spamassassin/local.cf中配置插件:

ok_languages en       # I understand english
inactive_languages '' # Enable all languages
add_header all Languages _LANGUAGES_
# score UNWANTED_LANGUAGE_BODY 5 # Increase score - not necessary and not recommended 

This recipe was incompletely tested with spamassassin version 3.4.2.此配方未使用 spamassassin 版本 3.4.2 进行完全测试。


To adapt these answers to excluding a different language would involve substituting the other language for english in the first case and substituting the other 2 character language code for en in the second case.为了使这些答案适应排除不同的语言,需要在第一种情况下用另一种语言代替english ,在第二种情况下用其他 2 个字符的语言代码代替en

Many modern email clients identify the character set of the email message, though not usually its language.许多现代 email 客户端识别 email 消息的字符集,但通常不是它的语言。 If you want to discard Japanese, Chinese, Korean, and Russian messages, you could try something like如果您想丢弃日文、中文、韩文和俄文消息,您可以尝试类似

:0HB
* ^Content-type:[  ]*text/[/;]*;[  ]*charset="?(iso-2022|ks-c|gb|koi|cp-1251)
foreign

Because some clients forget to change the character set when they write in English, this is likely to produce some false positives, so I recommend saving to a folder and reviewing it periodically.因为有些客户在写英文的时候忘记改字符集,这很可能会产生一些误报,所以我建议保存到一个文件夹并定期查看。 The opposite problem is harder;相反的问题更难。 many foreign languages use the same character set as English, and thus can't be identified like this with any reliability.许多外语使用与英语相同的字符集,因此不能像这样可靠地识别。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Java发送具有非英语发件人ID的电子邮件? - How to send email with non-english sender ID in Java? 如何从电子邮件中删除多余的非英语字符? - How to remove extra non-english characters from email message? 使用非英语字符发送电子邮件时,为什么某些字符显示为问号? - Why are some characters shown as question marks when sending email using non-English characters? 如何在Javascript和PHP中验证非英语(UTF-8)编码的电子邮件地址? - How to validate non-english (UTF-8) encoded email address in Javascript and PHP? 电子邮件地址可以包含国际(非英语)字符吗? - Can an email address contain international (non-english) characters? 如何使用procmail以base 64编码过滤电子邮件 - How to filter email with base 64 encoding using procmail 如何使用 Linux 命令行将文件作为电子邮件附件发送? - How do I send a file as an email attachment using Linux command line? 如何从命令行发送电子邮件? - How do I Send Email from the Command Line? 如何使用Procmail食谱解码base64编码的电子邮件主题字符串? - How to decode a base64-encoded email subject string using a Procmail recipe? 通过邮件发送非英文文本的问题 - problems with sending non-english text over mail
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM