简体   繁体   English

正则表达式:查找由某些字符组成的单词

[英]Regex: Find a word that consists of certain characters

I have a list of dictionary words, I would like to find any word that consists of (some or all) certain characters of a source word in any order :我有一个字典单词列表,我想查找由(部分或全部)源单词的某些字符以任意顺序组成的任何单词:

For Example:例如:

Characters (source word) to look for : stainless要查找的字符(源词):不锈钢

Found Words : stainless, stain, net, ten, less, sail, sale, tale, tales, ants, etc.发现词:不锈钢、污点、净、十、少、帆、销售、故事、故事、蚂蚁等。

Also if a letter is found once in the source word it can't be repeated in the found word此外,如果在源词中找到一次字母,则不能在找到的词中重复

Unacceptable words to find : tent (t is repeated), tall (l is repeated) , etc.难以找到的词:tent(t 重复)、tall(l 重复)等。

Acceptable words to find : less (s is already repeated in the source word), etc.可接受的词找到:less(s 已经在源词中重复)等。

You could take this approach:你可以采取这种方法:

  • Match any sequence of characters that are in the search word, requiring that the match is a word (word-boundaries)匹配搜索词中的任何字符序列,要求匹配是一个词(词边界)
  • Prohibit that a certain character occurs more often than it is present in the search word, using a negative look-ahead.禁止某个字符出现的频率高于它在搜索词中出现的频率,使用否定的前瞻。 Do this for every character that is in the search word.对搜索词中的每个字符执行此操作。

For the given example the regular expression would be:对于给定的示例,正​​则表达式将是:

(?!(\S*s){4}|(\S*t){2}|(\S*a){2}|(\S*i){2}|(\S*n){2}|(\S*l){2}|(\S*e){2})\b[stainless]+\b

The biggest part of the pattern deals with the negative look-ahead.该模式的最大部分处理的是负前瞻。 For example:例如:

  • (\S*s){4} would match four times an 's' in a single word. (\S*s){4}将匹配一个单词中的四次“s”。
  • (?! | ) places these patterns as different options in a negative look-ahead so that none of them should match. (?! | )将这些模式作为不同的选项放置在负前瞻中,这样它们都不应该匹配。

Automation自动化

It is clear that making such a regular expression for a given word needs some work, so that is where you could use some automation.很明显,为给定单词制作这样的正则表达式需要一些工作,因此您可以使用一些自动化。 Notepad++ cannot help with that, but in a programming environment it is possible. Notepad++ 对此无能为力,但在编程环境中是可能的。 Here is a little snippet in JavaScript that will give you the regular expression that corresponds to a given search word:这是 JavaScript 中的一个小片段,它将为您提供与给定搜索词相对应的正则表达式:

 function regClassEscape(s) { // Escape "[" and "^" and "-": return s.replace(/[\]^-]/g, "\\$&"); } function buildRegex(searchWord) { // get frequency of each letter: let freq = {}; for (let ch of searchWord) { ch = regClassEscape(ch); freq[ch] = (freq[ch] ?? 0) + 1; } // Produce negative options (too many occurrences) const forbidden = Object.entries(freq).map(([ch, count]) => "(\\S*[" + ch + "]){" + (count + 1) + "}" ).join("|"); // Produce character set const allowed = Object.keys(freq).join(""); return "(?!" + forbidden + ")\\b[" + allowed + "]+\\b"; } // I/O management const [input, output] = document.querySelectorAll("input,div"); input.addEventListener("input", refresh); function refresh() { if (/\s/.test(input.value)) { output.textContent = "Input should have no white space!"; } else { output.textContent = buildRegex(input.value); } } refresh();
 input { width: 100% }
 Search word:<br> <input value="stainless"> Regular expression: <div></div>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM