简体   繁体   English

检查Java字符串实例是否可能保留垃圾邮件数据的最简单方法

[英]Easiest Way to Check if a Java String Instance Might Hold Spam Data

I have a process which iterates String instances. 我有一个迭代String实例的过程。 Each iteration does few operations on the String instance. 每次迭代都对String实例执行少量操作。 At the end the String instance is persisted. 最后,String实例将保留。

Now, I want to add for each iteration a check if the String instance might be spam. 现在,我想为每次迭代添加一个检查String实例是否为垃圾邮件。 I only have to verify that the String instance is not "adult materials" spam. 我只需要验证String实例不是“成人材料”垃圾邮件。

Any recommendations? 有什么建议吗?

This is a very hard problem that the industry is constantly trying to solve. 这是业界一直努力解决的一个非常棘手的问题。 The best thing for you to do is to try and use an existing solution like Classifier4J along with a black-list datasource to identify spam. 最好的办法是尝试使用现有的解决方案(例如Classifier4J)和黑名单数据源来识别垃圾邮件。

You need to apply some Bayesian logic, which is what, among other things, Classifier4J that Andrew mentioned is doing beneath the covers. 您需要应用一些贝叶斯逻辑,其中包括安德鲁提到的Classifier4J在幕后所做的工作。

Paul Graham wrote a good article about this a few years back - http://www.paulgraham.com/spam.html . 几年前Paul Graham撰写了一篇很好的文章-http: //www.paulgraham.com/spam.html

You could try writing your own classifier etc, but if you have guaranteed network access, how about just using Akismet and the Java bindings ? 您可以尝试编写自己的分类器等,但是如果您保证有网络访问权限,那么仅使用Akismet和Java绑定怎么样? It's pretty good for finding spam. 查找垃圾邮件非常好。

You'll need to take the network connectivity and licensing into consideration. 您需要考虑网络连接和许可。

Easiest way is simply to check against known spam words. 最简单的方法就是检查已知的垃圾邮件词。 The problem here is that it's easy to get false positives with words that mean different things in different contexts. 这里的问题是,使用在不同上下文中表示不同含义的单词很容易得到误报。 You either need to hand-pick the word list and only include those which have no legitimate reason, or opt for a more heavyweight solution. 您要么需要手动选择单词列表,仅包括没有正当理由的单词列表,要么选择更重量级的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM