简体   繁体   English

将字符串与正则表达式部分匹配

[英]Partial matching a string against a regex

Suppose that I have this regular expression: /abcd/ Suppose that I wanna check the user input against that regex and disallow entering invalid characters in the input.假设我有这个正则表达式: /abcd/ 假设我想根据该正则表达式检查用户输入并禁止在输入中输入无效字符。 When user inputs "ab", it fails as an match for the regex, but I can't disallow entering "a" and then "b" as user can't enter all 4 characters at once (except for copy/paste).当用户输入“ab”时,它作为正则表达式的匹配失败,但我不能禁止输入“a”然后输入“b”,因为用户不能一次输入所有 4 个字符(复制/粘贴除外)。 So what I need here is a partial match which checks if an incomplete string can be potentially a match for a regex.所以我在这里需要的是部分匹配,它检查不完整的字符串是否可能与正则表达式匹配。

Java has something for this purpose: .hitEnd() (described here http://glaforge.appspot.com/article/incomplete-string-regex-matching ) python doesn't do it natively but has this package that does the job: https://pypi.python.org/pypi/regex . Java 有一些用于此目的的东西: .hitEnd() (在此处描述http://glaforge.appspot.com/article/incomplete-string-regex-matching )python 本身不是这样做的,但有这个包可以完成这项工作: https://pypi.python.org/pypi/regex

I didn't find any solution for it in js.我在 js 中没有找到任何解决方案。 It's been asked years ago: Javascript RegEx partial match and even before that: Check if string is a prefix of a Javascript RegExp多年前有人问过: Javascript RegEx partial match ,甚至在此之前: Check if string is a prefix of a Javascript RegExp

PS regex is custom, suppose that the user enters the regex herself and then tries to enter a text that matches that regex. PS regex 是自定义的,假设用户自己输入正则表达式,然后尝试输入与该正则表达式匹配的文本。 The solution should be a general solution that works for regexes entered at runtime.该解决方案应该是适用于在运行时输入的正则表达式的通用解决方案。

Looks like you're lucky, I've already implemented that stuff in JS (which works for most patterns - maybe that'll be enough for you).看起来你很幸运,我已经在 J​​S 中实现了这些东西(这适用于大多数模式 - 也许这对你来说就足够了)。 See my answer here .在这里看到我的答案 You'll also find a working demo there.您还可以在那里找到一个工作演示。

There's no need to duplicate the full code here, I'll just state the overall process:这里不需要复制完整代码,我只说明整个过程:

  • Parse the input regex, and perform some replacements.解析输入的正则表达式,并进行一些替换。 There's no need for error handling as you can't have an invalid pattern in a RegExp object in JS.不需要错误处理,因为在 JS 中的RegExp对象中不能有无效模式。
  • Replace abc with (?:a|$)(?:b|$)(?:c|$)abc替换为(?:a|$)(?:b|$)(?:c|$)
  • Do the same for any "atoms".对任何“原子”做同样的事情。 For instance, a character group [ac] would become (?:[ac]|$)例如,字符组[ac]将变成(?:[ac]|$)
  • Keep anchors as-is保持锚点不变
  • Keep negative lookaheads as-is保持负面展望

Had JavaScript have more advanced regex features, this transformation may not have been possible.如果 JavaScript 具有更高级的正则表达式功能,这种转换可能是不可能的。 But with its limited feature set, it can handle most input regexes.但由于其有限的功能集,它可以处理大多数输入正则表达式。 It will yield incorrect results on regex with backreferences though if your input string ends in the middle of a backreference match (like matching ^(\\w+)\\s+\\1$ against hello hel ).尽管如果您的输入字符串在反向引用匹配的中间结束(例如匹配^(\\w+)\\s+\\1$hello hel ),它将在带有反向引用的正则表达式上产生不正确的结果。

I think that you have to have 2 regex one for typing /a?b?c?d?/ and one for testing at end while paste or leaving input /abcd/我认为您必须有 2 个正则表达式,一个用于输入/a?b?c?d?/ ,另一个用于在粘贴或留下输入时进行测试/abcd/

This will test for valid phone number:这将测试有效的电话号码:

 const input = document.getElementById('input') let oldVal = '' input.addEventListener('keyup', e => { if (/^\\d{0,3}-?\\d{0,3}-?\\d{0,3}$/.test(e.target.value)){ oldVal = e.target.value } else { e.target.value = oldVal } }) input.addEventListener('blur', e => { console.log(/^\\d{3}-?\\d{3}-?\\d{3}-?$/.test(e.target.value) ? 'valid' : 'not valid') })
 <input id="input">

And this is case for name surname这是名字姓氏的情况

 const input = document.getElementById('input') let oldVal = '' input.addEventListener('keyup', e => { if (/^[AZ]?[az]*\\s*[AZ]?[az]*$/.test(e.target.value)){ oldVal = e.target.value } else { e.target.value = oldVal } }) input.addEventListener('blur', e => { console.log(/^[AZ][az]+\\s+[AZ][az]+$/.test(e.target.value) ? 'valid' : 'not valid') })
 <input id="input">

As many have stated there is no standard library, fortunately I have written a Javascript implementation that does exactly what you require.正如许多人所说,没有标准库,幸运的是我编写了一个 Javascript 实现,它完全符合您的要求。 With some minor limitation it works for regular expressions supported by Javascript.有一些小的限制,它适用于 Javascript 支持的正则表达式。 see: incr-regex-package .请参阅: incr-regex-package

Further there is also a react component that uses this capability to provide some useful capabilities:此外,还有一个反应组件使用此功能提供一些有用的功能:

  1. Check input as you type键入时检查输入
  2. Auto complete where possible尽可能自动完成
  3. Make suggestions for possible input values为可能的输入值提出建议

Demo of the capabilities Demo of use功能演示 使用演示

This is the hard solution for those who think there's no solution at all: implement the python version ( https://bitbucket.org/mrabarnett/mrab-regex/src/4600a157989dc1671e4415ebe57aac53cfda2d8a/regex_3/regex/_regex.c?at=default&fileviewer=file-view-default ) in js.对于那些认为根本没有解决方案的人来说,这是一个艰难的解决方案:实施 python 版本( https://bitbucket.org/mrabarnett/mrab-regex/src/4600a157989dc1671e4415ebe57aac53cfda2d8a/regex_3/regex/_defaultex.c?at= file-view-default ) 在 js 中。 So it is possible.所以这是可能的。 If someone has simpler answer he'll win the bounty.如果有人有更简单的答案,他将赢得赏金。

Example using python module (regular expression with back reference):使用 python 模块的示例(带反向引用的正则表达式):

$ pip install regex
$ python
>>> import regex
>>> regex.Regex(r'^(\w+)\s+\1$').fullmatch('abcd ab',partial=True)
<regex.Match object; span=(0, 7), match='abcd ab', partial=True>

You guys would probably find this page of interest:你们可能会发现这个页面感兴趣:

( https://github.com/desertnet/pcre ) ( https://github.com/desertnet/pcre )

It was a valiant effort: make a WebAssembly implementation that would support PCRE .这是一项勇敢的努力:制作一个支持PCRE的 WebAssembly 实现。 I'm still playing with it, but I suspect it's not practical.我还在玩它,但我怀疑它不实用。 The WebAssembly binary weighs in at ~300K; WebAssembly 二进制文件的权重约为 300K; and if your JS terminates unexpectedly, you can end up not destroying the module, and consequently leaking significant memory.如果您的 JS 意外终止,您最终不会破坏模块,从而泄漏大量内存。

The bottom line is: this is clearly something the ECMAscript people should be formalizing, and browser manufacturers should be furnishing (kudos to the WebAssembly developer into possibly shaming them to get on the stick...)底线是:这显然是 ECMAscript 人应该正式化的事情,浏览器制造商应该提供(感谢 WebAssembly 开发人员可能羞辱他们以坚持下去......)

I recently tried using the "pattern" attribute of an input[type='text'] element.我最近尝试使用 input[type='text'] 元素的“pattern”属性。 I, like so many others, found it to be a letdown that it would not validate until a form was submitted.我和许多其他人一样,发现它在提交表单之前不会验证是令人失望的。 So a person would be wasting their time typing (or pasting...) numerous characters and jumping on to other fields, only to find out after a form submit that they had entered that field wrong.因此,一个人会浪费时间输入(或粘贴...)大量字符并跳转到其他字段,只是在提交表单后发现他们输入了错误的字段。 Ideally, I wanted it to validate field input immediately, as the user types each key (or at the time of a paste...)理想情况下,我希望它立即验证字段输入,因为用户键入每个键(或在粘贴时...)

The trick to doing a partial regex match (until the ECMAscript people and browser makers get it together with PCRE...) is to not only specify a pattern regex, but associated template value(s) as a data attribute.进行部分正则表达式匹配的技巧(直到 ECMAscript 人员和浏览器制造商将其与 PCRE 结合在一起……)不仅指定模式正则表达式,而且指定关联的模板值作为数据属性。 If your field input is shorter than the pattern (or input.maxLength...), it can use them as a suffix for validation purposes.如果您的字段输入比模式(或 input.maxLength...)短,它可以将它们用作后缀以进行验证。 YES -this will not be practical for regexes with complex case outcomes;是 - 这对于具有复杂案例结果的正则表达式是不切实际的; but for fixed-position template pattern matching -which is USUALLY what is needed- it's fine (if you happen to need something more complex, you can build on the methods shown in my code...)但是对于固定位置模板模式匹配——这通常是需要的——很好(如果你碰巧需要更复杂的东西,你可以建立在我的代码中显示的方法......)

The example is for a bitcoin address [ Do I have your attention now?这个例子是比特币地址[我现在有你的注意力吗? -OK, not the people who don't believe in digital currency tech... ] The key JS function that gets this done is validatePattern. - 好吧,不是不相信数字货币技术的人......] 完成这项工作的关键 JS 函数是 validatePattern。 The input element in the HTML markup would be specified like this: HTML 标记中的 input 元素将被指定如下:

<input id="forward_address"
       name="forward_address"
       type="text"
       maxlength="90"
       pattern="^(bc(0([ac-hj-np-z02-9]{39}|[ac-hj-np-z02-9]{59})|1[ac-hj-np-z02-9]{8,87})|[13][a-km-zA-HJ-NP-Z1-9]{25,34})$"
       data-entry-templates="['bc099999999999999999999999999999999999999999999999999999999999','bc1999999999999999999999999999999999999999999999999999999999999999999999999999999999999999','19999999999999999999999999999999999']"
       onkeydown="return validatePattern(event)"
       onpaste="return validatePattern(event)"
       required
/>

[Credit goes to this post: " RegEx to match Bitcoin addresses? " Note to old-school bitcoin zealots who will decry the use of a zero in the regex here -it's just an example for accomplishing PRELIMINARY validation; [信用转到这篇文章:“ RegEx 匹配比特币地址? ” 老派比特币狂热者的注意事项,他们将在这里谴责在正则表达式中使用零 - 这只是完成初步验证的一个例子; the server accepting the address passed off by the browser can do an RPC call after a form post, to validate it much more rigorously.接受浏览器传递的地址的服务器可以在表单发布后进行 RPC 调用,以更严格地验证它。 Adjust your regex to suit.]调整您的正则表达式以适应。]

The exact choice of characters in the data-entry-template was a bit arbitrary;数据输入模板中字符的确切选择有点随意; but they had to be ones such that if the input being typed or pasted by the user is still incomplete in length, it will use them as an optimistic stand-in and the input so far will still be considered valid.但它们必须是这样的,如果用户输入或粘贴的输入的长度仍然不完整,它将使用它们作为乐观的替代品,并且到目前为止的输入仍将被认为是有效的。 In the example there, for the last of the data-entry-templates ('19999999999999999999999999999999999'), that was a "1" followed by 39 nines (seeing as how the regex spec "{25,39}" dictates that a maximum of 39 digits in the second character span/group...) Because there were two forms to expect -the "bc" prefix and the older "1"/"3" prefix- I furnished a few stand-in templates for the validator to try (if it passes just one of them, it validates...) In each template case, I furnished the longest possible pattern, so as to insure the most permissive possibility in terms of length.在那里的示例中,对于最后一个数据输入模板('19999999999999999999999999999999999'),这是一个“1”后跟 39 个 9(看看正则表达式规范“{25,39}”如何规定最多第二个字符跨度/组中有 39 位数字......)因为有两种形式 - “bc”前缀和旧的“1”/“3”前缀 - 我为验证器提供了一些替代模板尝试(如果它仅通过其中一个,则验证...)在每个模板案例中,我都提供了尽可能的模式,以确保在长度方面最宽松的可能性。

If you were generating this markup on a dynamic web content server, an example with template variables (a la django...) would be:如果您在动态 Web 内容服务器上生成此标记,则带有模板变量(a la django...)的示例将是:

 <input id="forward_address"
        name="forward_address"
        type="text"
        maxlength="{{MAX_BTC_ADDRESS_LENGTH}}"
        pattern="{{BTC_ADDRESS_REGEX}}" {# base58... #}
        data-entry-templates="{{BTC_ADDRESS_TEMPLATES}}" {# base58... #}
        onkeydown="return validatePattern(event)"
        onpaste="return validatePattern(event)"
        required
/>

[Keep in mind: I went to the deeper end of the pool here. [请记住:我去了这里的游泳池更深的一端。 You could just as well use this for simpler patterns of validation.]您也可以将其用于更简单的验证模式。]

And if you prefer to not use event attributes, but to transparently hook the function to the element's events at document load -knock yourself out.如果您不想使用事件属性,而是希望在文档加载时将函数透明地挂接到元素的事件上,请自行解决。

You will note that we need to specify validatePattern on three events:您会注意到我们需要在三个事件上指定 validatePattern :

  • The keydown, to intercept delete and backspace keys. keydown,拦截删除和退格键。

  • The paste (the clipboard is pasted into the field's value, and if it works, it accepts it as valid; if not, the paste does not transpire...)粘贴(剪贴板被粘贴到字段的值中,如果有效,则接受它为有效;如果无效,则粘贴不会发生......)

Of course, I also took into account when text is partially selected in the field, dictating that a key entry or pasted text will replace the selected text.当然,我还考虑了在字段中部分选择文本的情况,指示关键条目或粘贴的文本将替换所选文本。

And here's a link to the [dependency-free] code that does the magic:这是一个链接到 [无依赖] 代码的神奇之处:

https://gitlab.com/osfda/validatepattern.js https://gitlab.com/osfda/validatepattern.js

(If it happens to generate interest, I'll integrate constructive and practical suggestions and give it a better readme...) (如果碰巧引起兴趣,我会整合建设性和实际的建议,并给出更好的自述……)

PS: The incremental-regex package posted above by Lucas Trzesniewski: PS:Lucas Trzesniewski 在上面发布的增量正则表达式包:

  • Appears not to have been updated?好像没有更新? (I saw signs that it was undergoing modification??) (我看到它正在被修改的迹象??)

  • Is not browserified (tried doing that to it, to kick the tires on it -it was a module mess; welcome anyone else here to post a browserified version for testing. If it works, I'll integrate it with my input validation hooks and offer it as an alternative solution...) If you succeed in getting it browserfied, maybe sharing the exact steps that were needed would also edify everyone on this post.不是浏览器化的(尝试这样做,以消除它的轮胎 - 这是一个模块混乱;欢迎这里的任何其他人发布浏览器化版本进行测试。如果有效,我会将其与我的输入验证挂钩和提供它作为替代解决方案......)如果你成功地让它浏览器,也许分享所需的确切步骤也会启发这篇文章中的每个人。 I tried using the esm package to fix version incompatibilities faced by browserify, but it was no go...我尝试使用esm包来修复browserify面临的版本不兼容问题,但没有成功......

I strongly suspect (although I'm not 100% sure) that general case of this problem has no solution the same way as famous Turing's "Haltin problem" (see Undecidable problem ).我强烈怀疑(虽然我不是 100% 确定)这个问题的一般情况没有与著名的图灵的“Haltin 问题”相同的解决方案(请参阅Undecidable problem )。 And even if there is a solution, it most probably will be not what users actually want and thus depending on your strictness will result in a bad-to-horrible UX.即使有解决方案,也很可能不是用户真正想要的,因此根据您的严格程度,会导致糟糕到可怕的用户体验。

Example:例子:

Assume "target RegEx" is [a,b]*c[a,b]* also assume that you produced a reasonable at first glance "test RegEx" [a,b]*c?[a,b]* (obviously two c in the string is invalid, yeah?) and assume that the current user input is aabcbb but there is a typo because what the user actually wanted is aacbbb .假设“目标正则表达式”是[a,b]*c[a,b]*还假设您乍一看产生了合理的“测试正则表达式” [a,b]*c?[a,b]* (显然是两个字符串中的c无效,是吗?)并假设当前用户输入是aabcbb但有一个错字,因为用户实际想要的是aacbbb There are many possible ways to fix this typo:有许多可能的方法来修复这个错字:

  • remove c and add it before first b - will work OK删除c并在第一个b之前添加它 - 可以正常工作
  • remove first b and add after c - will work OK删除第一个b并在c之后添加 - 可以正常工作
  • add c before first b and then remove the old one - Oops, we prohibit this input as invalid and the user will go crazy because no normal human can understand such a logic.在第一个b之前添加c然后删除旧的 - 哎呀,我们禁止此输入无效,用户会发疯,因为没有正常人可以理解这样的逻辑。

Note also that your hitEnd will have the same problem here unless you prohibit user to enter characters in the middle of the input box that will be another way to make a horrible UI.还要注意,你的hitEnd在这里也会有同样的问题,除非你禁止用户在输入框的中间输入字符,这将是另一种制作可怕 UI 的方式。

In the real life there would be many much more complicated examples that any of your smart heuristics will not be able to account for properly and thus will upset users.在现实生活中,会有许多更复杂的例子,您的任何智能启发式方法都无法正确解释,从而使用户感到不安。

So what to do?那么该怎么办? I think the only thing you can do and still get reasonable UX is the simplest thing you can do ie just analyze your "target RegEx" for set of allowed characters and make your "test RegEx" [set of allowed chars]* .我认为你唯一可以做的并且仍然获得合理的用户体验是你可以做的最简单的事情,即只需分析你的“目标正则表达式”以获得一组允许的字符并制作你的“测试正则表达式” [set of allowed chars]* And yes, if the "target RegEx" contains .是的,如果“目标 RegEx”包含. wildcart, you will not be able to do any reasonable filtering at all.通配符,您根本无法进行任何合理的过滤。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM