简体   繁体   English

带有特殊字符的坏词过滤器

[英]Bad-words filter with special characters

I am using https://www.npmjs.com/package/bad-words and i created regex for filter special characters. 我正在使用https://www.npmjs.com/package/bad-words ,我为过滤器特殊字符创建了正则表达式。

const Filter = require('bad-words');
const badWordsFilter = new Filter({replaceRegex:  /[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/g});
badWordsFilter.addWords(['badword', 'şğ'])

If word doesn't contain turkish character, it works. 如果单词不包含土耳其字符,则可行。 But if i write turkish character like ş or ğ it is not filtering. 但如果我写ş或ğ等土耳其字符,它不会过滤。

Is my regex wrong? 我的正则表达式错了吗?

I found this code in documentation: 我在文档中找到了这段代码:

var filter = new Filter({ regex: /\*|\.|$/gi });
var filter = new Filter({ replaceRegex:  /[A-Za-z0-9가-힣_]/g }); 
//multilingual support for word filtering

You obviously have an encoding problem since your regex works out of your app, see here: https://regex101.com/r/VpItfH/3/ . 您显然遇到编码问题,因为您的正则表达式适用于您的应用,请参阅此处: https//regex101.com/r/VpItfH/3/

So I think encoding your characters in your regex in your app may help: 因此,我认为在您的应用中使用正则表达式编码角色可能有所帮助:

See the encoded regex result here: https://regex101.com/r/VpItfH/4/ 请在此处查看已编码的正则表达式结果: https//regex101.com/r/VpItfH/4/


More details 更多细节

Trying the following encoded regex in a PCRE regex engine will work ( https://regex101.com/r/VpItfH/5 ): 在PCRE正则表达式引擎中尝试以下编码的正则表达式将起作用( https://regex101.com/r/VpItfH/5 ):

/[A-Za-z0-9\x{f6}\x{d6}\x{c7}\x{e7}\x{15e}\x{15f}\x{11e}\x{11f}\x{130}\x{131}\x{dc}\x{fc}_]/g

but when selecting a javascript regex engine the { , } will break the unicode so you need to remove them and if the character is not recognized then replace \\x with \\u0\u003c/code> . 但是当选择一个javascript正则表达式引擎时, {}将打破unicode,因此您需要删除它们,如果该字符未被识别,则将\\x替换为\\u0\u003c/code> 。 Eg \\x{15e} becomes 例如, \\x{15e}

Then you can do the same match as when you use /[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/g . 然后你可以做同样的比赛,当你使用/[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/g

Note : to get the unicode form of a character, you can do "Ğ".charCodeAt(0).toString(16); 注意 :要获取字符的unicode形式,可以执行"Ğ".charCodeAt(0).toString(16); and prefix it with \\x or \\u0\u003c/code> . 并用\\x\\u0\u003c/code>前缀。

Hope this can help, and at least acknowledge that you can encode characters inside a regex and still match the same. 希望这可以提供帮助,并且至少承认您可以对正则表达式中的字符进行编码并且仍然匹配相同的字符。 :) :)

Can you please try with: 你可以试试:

var filter = new Filter({ replaceRegex: /(\\w+)/gi });

For sure you have to use replaceRegex option. 当然,您必须使用replaceRegex选项。


The pattern matches everything case insentively. 该模式无疑地匹配所有案例。

Here's what /(\\w+)/gi does descriptively (thanks to regex101 ): 这是/(\\w+)/gi描述性的(感谢regex101 ):

  1. 1st Capturing Group (\\w+). 第一捕获组(\\ w +)。
    1. \\w+ matches any word character (equal to [a-zA-Z0-9_]) \\ w +匹配任何单词字符(等于[a-zA-Z0-9_])
    2. + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) +量词 - 在一次和无限次之间匹配,尽可能多次,根据需要回馈(贪婪)
  2. Global pattern flags 全局模式标志
    1. i modifier : insensitive. 我修饰语 :不敏感。 Case insensitive match (ignores case of [a-zA-Z]) 不区分大小写的匹配(忽略[a-zA-Z]的情况)
    2. g modifier : global. g修饰符 :全局。 All matches (don't return after first match) 所有比赛(首场比赛后不返回)

You need to make that regular expression Unicode-aware by adding the u flag to it. 您需要通过向其添加u标志来使该正则表达式具有Unicode感知能力 More precisely, change /[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/g into /[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/gu (added a u at the end). 更准确地说,将/[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/g改为/[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/gu (最后添加了一个u )。 This will work only in modern browsers (basically, all but Internet Explorer) though. 这只适用于现代浏览器 (基本上只有Internet Explorer)。 There are other options as well, that you may want to consider if you want to support older browsers. 还有其他选项 ,您可能需要考虑是否要支持旧版浏览器。

Encode your javascript file into utf-8 and update your meta tag to: 将您的javascript文件编码为utf-8并将您的元标记更新为:

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

Hoping this will help you. 希望这会对你有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM