简体   繁体   English

在字符串中查找数组中的单词彼此相邻的位置

[英]Find in string where words from array are next to each other

Say I have a sentence or two in a string, and I have an array of words.假设我在一个字符串中有一个或两个句子,并且我有一个单词数组。 I need to find anywhere in the string where two or more words from the array are next to each other.我需要在字符串中找到数组中两个或多个单词彼此相邻的任何位置。

Example:例子:

Words: ['cat','dog','and','the']词: ['cat','dog','and','the']

String: There is a dog and cat over there. The cat likes the dog.弦: There is a dog and cat over there. The cat likes the dog. There is a dog and cat over there. The cat likes the dog.

Result: ['dog and cat','the dog','the cat']结果: ['dog and cat','the dog','the cat']

The only way I've been able to do this is manually specifying possible combinations, but only for 3 words max as it gets long fast.我能够做到这一点的唯一方法是手动指定可能的组合,但最多只能使用 3 个单词,因为它会很快变长。

You can use two pointers to iterate over the array keeping track of beginning and end of each sequence of words that are included in the words array.您可以使用两个指针来遍历数组,以跟踪words数组中包含的每个单词序列的开头和结尾。 Here first transforming the string to an array of lowercase words with punctuation removed (you would need to expand on the characters to remove).这里首先将字符串转换为删除标点符号的小写单词数组(您需要扩展要删除的字符)。

 const words = ['cat', 'dog', 'and', 'the'], string = 'There is a dog and cat over there. The cat likes the dog.'; let stringArray = string.toLowerCase().replace(/[.,]/g, '').split(' '), start = 0, end = 0, result = []; while (start < stringArray.length) { if (words.includes(stringArray[start])) { end = start + 1; while (words.includes(stringArray[end])) { end++ } if (end - start >= 2) { result.push(stringArray.slice(start, end).join(' ')); } start = end; } start++ } console.log(result)

This also works for the corner case were 2 consecutive words come between the ending of a sentence and beginning of a new one.这也适用于极端情况,即句子结尾和新句子开头之间有 2 个连续单词。 Something like "A cat. The watcher" will not match, because technically they are not consecutive words."A cat. The watcher"这样的东西不会匹配,因为从技术上讲,它们不是连续的词。 There is a dot between them.它们之间有一个点。

The code treats a dot like a "word", by first removing the dots in the text, and then reinserting them with a space on both sides, as in " . " .该代码将点视为“单词”,首先删除文本中的点,然后重新插入它们,两边都有一个空格,如" . " Thus, the dots act as "connection words" between sentences.因此,点充当句子之间的“连接词”。 This removes special treatment of the corner case, because having a dot between 2 words, means they will never match as 2 consecutive words.这消除了对极端情况的特殊处理,因为在 2 个单词之间有一个点,意味着它们永远不会匹配为 2 个连续的单词。 The text then removes any extra spaces, and is split into words:然后文本删除任何多余的空格,并分成单词:

const words = ['cat', 'dog', 'and', 'the']
const text = 'There is a dog and cat over there. A cat. The cat likes the dog.'
const xs = text.toLowerCase().replace(/\./g," . ").replace(/ +(?= )/g,'').split(' ')

var result = []
var matched = []

xs.forEach(x => {
     if (words.includes(x))
         matched.push(x)
     else {
         if (matched.length > 1) 
            result.push(matched.join(' '))
         matched = []
     }
})

console.log(result)

Result: ['dog and cat', 'the dog', 'the cat']

I'd do it with two reduces: one that groups successive words in the target set by accumulating them in arrays, and a another that rejects empty arrays (where runs end) and joins the successive sets...我会用两个减少来做到这一点:一个通过在数组中累积目标集中的连续单词来分组,另一个拒绝空数组(运行结束)并加入连续的集合......

 const words = ['cat','dog','and','the']; const wordSet = new Set(words); // optional for O(1) lookup const string = 'There is a dog and cat over there. The cat likes the dog.'; const tokens = string.split(/[ .]+/).map(t => t.toLowerCase()); // split for space and periods, force lower case const result = tokens .reduce((acc, word) => { if (wordSet.has(word)) acc[acc.length-1].push(word); else acc.push([]); return acc; }, [[]]) .reduce((acc, run) => { if (run.length) acc.push(run.join(' ')); return acc; }, []); console.log(result);

This problem could be approached by 'walking through' the sentence, beginning at each word and continuing each pass until the word in the sentence is no longer present in the array.这个问题可以通过“遍历”句子来解决,从每个单词开始并继续每次遍历,直到句子中的单词不再出现在数组中。

For example, the first iteration would start at the first word of the sentence and check whether it's in the array.例如,第一次迭代将从句子的第一个单词开始,并检查它是否在数组中。 If not in the array, begin again at the second word.如果不在数组中,则从第二个单词重新开始。 If the word is present, check the next, ending if it's not in the array, or continuing if it is.如果单词存在,检查下一个,如果它不在数组中,则结束,如果在,则继续。

Two while loops allow for this.两个while循环允许这样做。 Non-alphabet characters such as punctuation are removed for the presence test using a regex.replace statement, while capitals are changed to lower case for the comparison:使用regex.replace语句删除存在测试中的非字母字符(例如标点符号),同时将大写更改为小写以进行比较:

sentenceWordArray[position].toLowerCase().replace(/[^a-z]+/g, '')

a break statement is required in the inner while loop to prevent an out-of-bounds error should the position exceed the length of the sentence word array.如果位置超过句子单词数组的长度,则内部while循环中需要一个break语句来防止越界错误。

Working snippet:工作片段:

 const words = ['cat','dog','and','the']; const sentence = "There is a dog and cat over there. The cat likes the dog." function matchWordRuns(sentence, dictionary) { const sentenceWordArray = sentence.split(" "); const results = []; let position = 0; const currentSearch = []; while (position < sentenceWordArray.length) { while (dictionary.indexOf(sentenceWordArray[position].toLowerCase().replace(/[^az]+/g, '')) > -1){ currentSearch.push(sentenceWordArray[position].toLowerCase().replace(/[^az]+/g, '')); position++; if (position>=sentenceWordArray.length) { break; } } // end while word matched; if (currentSearch.length>0) { results.push(currentSearch.join(" ")); } // end if; position++; currentSearch.length=0; // empty array; } // end while, search over; return results; } // end function; console.log(matchWordRuns(sentence, words)); /* result: [ "dog and cat", "the cat", "the dog" ] */

Same idea as pilchard's, with several refinements:与 pilchard 的想法相同,但有一些改进:

  • Using a regular expression with Unicode character class to know what "letters" are, and where sentences end — consequently, we don't need to list punctuation explicitly, and it should work on any language (eg "日本語!" , which does not have "." , nor matches [az] )使用带有 Unicode 字符类的正则表达式来知道“字母”是什么,以及句子在哪里结束——因此,我们不需要明确列出标点符号,它应该适用于任何语言(例如"日本語!" ,它确实没有"." ,也不匹配[az] )

  • The result is made from substrings of the original string, so it preserves case and intervening punctuation (which may or may not be what OP wants; pass it again through .toLowerCase and .replace , if necessary)结果是由原始字符串的子字符串生成的,因此它保留了大小写和中间标点符号(这可能是也可能不是 OP 想要的;如有必要,再次通过.toLowerCase.replace传递)

  • Set for efficiency (assuming string and words are long enough to make it worth it) Set效率(假设stringwords足够长以使其值得)

  • Generator function for more flexibility and just because I don't see them often :P生成器功能更灵活,只是因为我不经常看到它们:P

  • Processes sentences separately, so it does not detect "cat. The dog"分别处理句子,因此它不会检测到"cat. The dog"

 const words = ['cat','dog','and','the']; const string = "There is a dog and cat over there. The cat likes the dog."; function* findConsecutive(words, string) { const wordSet = new Set(words.map(word => word.toLowerCase())); const sentences = string.split(/\s*\p{Sentence_Terminal}+\s*/u); for (const sentence of sentences) { let start = null, end; const re = /\p{Letter}+/gu; while ((match = re.exec(sentence)) !== null) { if (wordSet.has(match[0].toLowerCase())) { start ??= match.index; end = match.index + match[0].length; } else if (start !== null) { yield sentence.substring(start, end); start = null; } } if (start !== null) { yield sentence.substring(start, end); } } } console.log([...findConsecutive(words, string)]);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM