简体   繁体   English

PHP筛选禁止单词的文本

[英]PHP Filter text for banned words

We have a C2C website and we discourage selling branded products on our website. 我们有一个C2C网站,我们不鼓励在我们的网站上销售品牌产品。 We have built a database of brand words such as Nike and D&G and made an algorithm that filters product information for these words and disables products if it contains these words. 我们建立了一个品牌词汇数据库,如NikeD&G,并制作了一个算法来过滤这些词的产品信息,并禁用产品,如果它包含这些词。

Our current algorithm removes all white space and special characters from provided text and matches text with word from database. 我们当前的算法从提供的文本中删除所有空格和特殊字符,并将文本与数据库中的单词匹配。 These cases are required to be caught by algorithm and are caught efficiently: 这些情况需要通过算法捕获并有效捕获:

  • i am nike world 我是世界的
  • i have n ikee shoes 我有n个鞋子
  • i have nikeeshoes 我有nikeeshoes
  • i sell i-phone casings 我卖i-phone外壳
  • i sell iphone-casings 我卖iphone-casing
  • you can have iphone 你可以有iphone

Now the problem is that it also catches following: 现在的问题是它还包含以下内容:

  • rapiD Garment factory (for D&G) rapiD服装厂(D&G)
  • rosNIK Electronics (for Nike) rosNIK Electronics(耐克)

What can be done to prevent such false matches while preserving efficiency with catching true cases? 如何在捕获真实案例的同时保持效率,可以采取哪些措施来防止这种错误匹配?

EDIT 编辑

Here's the code for those of you who understand code better: 以下是那些更了解代码的人的代码:

$orignal_txt = preg_replace('/&.{0,}?;/', '', (strip_tags($orignal_txt)));
$orignal_txt_nospace = preg_replace('/\W/', '', $orignal_txt);
{
    $qry_kws = array("nike", "iphone", "d&g");
    foreach($qry_kws as $rs_kw)
    {       
        $no_space_db_kw = preg_replace('/\W/', '', $rs_kw);
        if(stristr($orignal_txt_nospace, $rs_kw))
        {
            $ipr_banned_keywords[] = strtolower($rs_kw);
        }
        else if(stristr($orignal_txt_nospace, $no_space_db_kw))
        {
                $ipr_banned_keywords[] = strtolower($rs_kw);
        }

    }
}

Just playing around .... (Not to be used in production) 只是玩....(不用于生产)

$data = array(
        "i am nike world",
        "i have n ikee shoes",
        "i have nikeeshoes",
        "i sell i-phone casings",
        "i sell iphone-casings",
        "you can have iphone",
        "rapiD Garment factor",
        "rosNIK Electronics",
        "Buy you self N I K E",
        "B*U*Y I*P*H*O*N*E BABY",
        "My Phone Is not available");


$ban = array("nike","d&g","iphone");

Example 1: 例1:

$filter = new BrandFilterIterator($data);
$filter->parseBan($ban);
foreach ( $filter as $word ) {
    echo $word, PHP_EOL;
}

Output 1 输出1

rapiD Garment factor
rosNIK Electronics
My Phone Is not available

Example 2 例2

$filter = new BrandFilterIterator($data,true); //reverse filter
$filter->parseBan($ban);
foreach ( $filter as $word ) {
    echo $word, " " , json_encode($word->getBan()) ,  PHP_EOL;
}

Output 2 输出2

i am nike world ["nike"]
i have n ikee shoes ["nike"]
i have nikeeshoes ["nike"]
i sell i-phone casings ["iphone"]
i sell iphone-casings ["iphone"]
you can have iphone ["iphone"]
Buy you self N I K E ["nike"]
B*U*Y I*P*H*O*N*E BABY ["iphone"]

Class Used 使用的类

class BrandFilterIterator extends FilterIterator {
    private $words = array();
    private $reverse = false;

    function __construct(array $words, $reverse = false) {
        $this->reverse = $reverse;
        foreach ( $words as $word ) {
            $this->words[] = new Word($word);
        }
        parent::__construct(new ArrayIterator($this->words));
    }

    function parseBan(array $ban) {
        foreach ( $ban as $item ) {
            foreach ( $this->words as $word ) {
                $word->checkMetrix($item);
            }
        }
    }

    public function accept() {
        if ($this->reverse) {
            return $this->getInnerIterator()->current()->accept() ? false : true;
        }
        return $this->getInnerIterator()->current()->accept();
    }
}


class Word {
    private $ban = array();
    private $word;
    private $parts;
    private $accept = true;

    function __construct($word) {
        $this->word = $word;
        $this->parts = explode(" ", $word);
    }

    function __toString() {
        return $this->word;
    }

    function getTrim() {
        return preg_replace('/\W/', '', $this->word);
    }

    function accept() {
        return $this->accept;
    }

    function getBan() {
        return array_unique($this->ban);
    }

    function reject($ban = null) {
        $ban === null or $this->ban[] = $ban;
        $this->accept = false;
        return $this->accept;
    }

    function checkMetrix($ban) {
        foreach ( $this->parts as $part ) {
            $part = strtolower($part);
            $ban = strtolower($ban);
            $t = ceil(strlen(strtolower($ban)) / strlen($part) * 100);
            $s = similar_text($part, $ban, $p);
            $l = levenshtein($part, $part);
            if (ceil($p) >= $t || ($t == 100 && $p >= 75 && $l == 0)) {
                $this->reject($ban);
            }
        }
        // Detect Bad Use of space
        if (ceil(strlen($this->getTrim()) / strlen($this->word) * 100) < 75) {
            if (stripos($this->getTrim(), $ban) !== false) {
                $this->reject($ban);
            }
        }
        return $this->accept;
    }
}

Simple, do the brand match before you remove spaces/special characters. 很简单,在删除空格/特殊字符之前进行品牌匹配。 Then it won't match these weird edge cases. 然后它将不匹配这些奇怪的边缘情况。

You already know this, but it's worth saying explicitly: Your current algorithm is completely inadequate for the task. 你已经知道了这一点,但值得明确说明:你当前的算法完全不适合这项任务。 It can't deal with even simple cases, let alone cases where people deliberately try to get past your filter. 它甚至不能处理简单的情况,更不用说人们故意试图通过你的过滤器的情况了。 There's only one thing you can do with your current filter, and that's throw it away completely -- it can't be made to work. 你可以用你当前的过滤器做一件事,那就完全扔掉了 - 它无法发挥作用。

While we aren't discussing an obsenity filter here, it is pretty much the same sort of concept, so you would be well advised to read up on some of the worst mistakes made by obsenity filters. 虽然我们在这里没有讨论一个obsenity过滤器,但它几乎是同一种概念,因此建议您阅读由obsenity过滤器产生的一些最严重的错误。

These articles mostly deal with false-positives -- ie where the filter makes a match on something that it shouldn't and thus blocks a legitimate entry. 这些文章主要涉及误报 - 即过滤器对不应该的东西进行匹配,从而阻止合法的条目。 This sort of thing can be very damaging as it will upset your customers and if it happens a lot it will drive people away from your site. 这种事情可能会非常具有破坏性,因为它会让您的客户感到不安,如果它发生了很多,它会让人们远离您的网站。 The complexities of natural language make it almost enevitable. 自然语言的复杂性几乎是不可避免的。

You also need to be aware of false-negatives. 您还需要了解假阴性。 These are where your filter fails to pick up something that it should pick up. 这些是您的过滤器无法获取应该拾取的东西的地方。 Your problem here is that spammers have a massive arsenal of techniques for getting past filters. 你的问题在于垃圾邮件制造者拥有大量的技术来获取过滤器。 Your current filter would be trivial to get past, but even the most advanced filters can be defeated -- check how much spam you get in your inbox for evidence of this. 您当前的过滤器很容易过去,但即使是最先进的过滤器也可能会失败 - 请检查您收件箱中收到多少垃圾邮件以获取此类证据。 And they change their techniques all the time, so a static algorithm simply isn't going to work in the long term. 而且他们一直在改变他们的技术,所以静态算法根本不会长期发挥作用。

A Bayesian filter would seem to be the best solution for you. 贝叶斯过滤器似乎是最适合您的解决方案。 These are filters that learn as they go. 这些过滤器可以随时学习。 You need to keep an eye on them and train them to recognise what needs to be filtered, so it'll be a bit of work to set up, but I doubt you'll have a workable solution any other way. 您需要密切关注它们并培训它们以识别需要过滤的内容,因此设置起来会有一些工作,但我怀疑您是否会以任何其他方式提供可行的解决方案。

Here's just an idea. 这只是一个想法。

Why don't you do the matching first and if it hits the "branded" filter, it gets put in the review queue for you to accept / decline, highlighting the matches for easy discovery. 为什么不首先进行匹配,如果它遇到“品牌”过滤器,它会被放入审核队列中供您接受/拒绝,突出显示匹配以便于发现。

Humans will be able to spot whether a brand is used almost immediately and accurately. 人类将能够发现几乎是立即和准确地使用品牌。 You could even turn this into machine learning, who knows :) 你甚至可以把它变成机器学习,谁知道:)

Having said that, this is not a regex problem and can't be solved by nifty expressions; 话虽如此,这不是一个正则表达式问题,并不能通过漂亮的表达来解决; the system needs to be trained, remember hits (increase confidence) and learn from misses. 系统需要训练,记住命中(增加信心)并从未命中中学习。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM