[英]parsings strings: extracting words and phrases [JavaScript]
I need to support exact phrases (enclosed in quotes) in an otherwise space-separated list of terms.我需要在一个以空格分隔的术语列表中支持精确的短语(括在引号中)。 Thus splitting the respective string by the space-character is not sufficient anymore.
因此,用空格字符分割相应的字符串是不够的。
Example:例子:
input : 'foo bar "lorem ipsum" baz'
output: ['foo', 'bar', 'lorem ipsum', 'baz']
I wonder whether this could be achieved with a single RegEx, rather than performing complex parsing or split-and-rejoin operations.我想知道这是否可以通过单个 RegEx 来实现,而不是执行复杂的解析或拆分和重新连接操作。
Any help would be greatly appreciated!任何帮助将不胜感激!
var str = 'foo bar "lorem ipsum" baz';
var results = str.match(/("[^"]+"|[^"\s]+)/g);
... returns the array you're looking for. ...返回您要查找的数组。
Note, however:但是请注意:
replace(/^"([^"]+)"$/,"$1")
on the results.replace(/^"([^"]+)"$/,"$1")
删除。lorem
and ipsum
, they'll be in the result.lorem
和ipsum
之间有三个空格,它们就会出现在结果中。 You can fix this by running replace(/\\s+/," ")
on the results.replace(/\\s+/," ")
来解决此问题。"
after ipsum
(ie an incorrectly-quoted phrase) you'll end up with: ['foo', 'bar', 'lorem', 'ipsum', 'baz']
ipsum
之后没有结束"
(即错误引用的短语),您将得到: ['foo', 'bar', 'lorem', 'ipsum', 'baz']
Try this:尝试这个:
var input = 'foo bar "lorem ipsum" baz';
var R = /(\w|\s)*\w(?=")|\w+/g;
var output = input.match(R);
output is ["foo", "bar", "lorem ipsum", "baz"]
Note there are no extra double quotes around lorem ipsum请注意,lorem ipsum 周围没有额外的双引号
Although it assumes the input has the double quotes in the right place:虽然它假设输入在正确的位置有双引号:
var input2 = 'foo bar lorem ipsum" baz'; var output2 = input2.match(R);
var input3 = 'foo bar "lorem ipsum baz'; var output3 = input3.match(R);
output2 is ["foo bar lorem ipsum", "baz"]
output3 is ["foo", "bar", "lorem", "ipsum", "baz"]
And won't handle escaped double quotes (is that a problem?):并且不会处理转义的双引号(这是一个问题吗?):
var input4 = 'foo b\"ar bar\" \"bar "lorem ipsum" baz';
var output4 = input4.match(R);
output4 is ["foo b", "ar bar", "bar", "lorem ipsum", "baz"]
A simple regular expression will do but leave the quotation marks.一个简单的正则表达式就可以了,但要留下引号。 eg
例如
'foo bar "lorem ipsum" baz'.match(/("[^"]*")|([^\s"]+)/g)
output: ['foo', 'bar', '"lorem ipsum"', 'baz']
edit: beaten to it by shyamsundar, sorry for the double answer编辑:被 shyamsundar 打败了,抱歉双重回答
Thanks a lot for the quick responses!非常感谢您的快速回复!
Here's a summary of the options, for posterity:以下是选项的摘要,供后代使用:
var input = 'foo bar "lorem ipsum" baz';
output = input.match(/("[^"]+"|[^"\s]+)/g);
output = input.match(/"[^"]*"|\w+/g);
output = input.match(/("[^"]*")|([^\s"]+)/g)
output = /(".+?"|\w+)/g.exec(input);
output = /"(.+?)"|(\w+)/g.exec(input);
For the record, here's the abomination I had come up with:为了记录,这是我想出的可憎之处:
var input = 'foo bar "lorem ipsum" "dolor sit amet" baz';
var terms = input.split(" ");
var items = [];
var buffer = [];
for(var i = 0; i < terms.length; i++) {
if(terms[i].indexOf('"') != -1) { // outer phrase fragment -- N.B.: assumes quote is either first or last character
if(buffer.length === 0) { // beginning of phrase
//console.log("start:", terms[i]);
buffer.push(terms[i].substr(1));
} else { // end of phrase
//console.log("end:", terms[i]);
buffer.push(terms[i].substr(0, terms[i].length - 1));
items.push(buffer.join(" "));
buffer = [];
}
} else if(buffer.length != 0) { // inner phrase fragment
//console.log("cont'd:", terms[i]);
buffer.push(terms[i]);
} else { // individual term
//console.log("standalone:", terms[i]);
items.push(terms[i]);
}
//console.log(items, "\n", buffer);
}
items = items.concat(buffer);
//console.log(items);
ES6 solution supporting: ES6 解决方案支持:
Code:代码:
input.match(/\\?.|^$/g).reduce((p, c) => {
if(c === '"'){
p.quote ^= 1;
}else if(!p.quote && c === ' '){
p.a.push('');
}else{
p.a[p.a.length-1] += c.replace(/\\(.)/,"$1");
}
return p;
}, {a: ['']}).a
Output:输出:
[ 'foo', 'bar', 'lorem ipsum', 'baz' ]
'foo bar "lorem ipsum" baz'.match(/"[^"]*"|\w+/g);
虽然边界引号被包括在内
how about,怎么样,
output = /(".+?"|\w+)/g.exec(input)
then do a pass on output to lose the quotes.然后传递输出以丢失引号。
alternately,交替,
output = /"(.+?)"|(\w+)/g.exec(input)
then do a pass n output to lose the empty captures.然后执行 pass n 输出以丢失空捕获。
One that's easy to understand and a general solution.一个易于理解和通用的解决方案。 Works for all delimiters and 'join' characters.
适用于所有分隔符和“加入”字符。 Also supports 'joined' words that are more than two words in length.... ie lists like
还支持长度超过两个单词的“连接”单词......即列表如下
"hello my name is 'jon delaware smith fred' I have a 'long name'"
.... "hello my name is 'jon delaware smith fred' I have a 'long name'"
......
A bit like the answer by AC but a bit neater...有点像 AC 的回答,但更简洁...
function split(input, delimiter, joiner){
var output = [];
var joint = [];
input.split(delimiter).forEach(function(element){
if (joint.length > 0 && element.indexOf(joiner) === element.length - 1)
{
output.push(joint.join(delimiter) + delimiter + element);
joint = [];
}
if (joint.length > 0 || element.indexOf(joiner) === 0)
{
joint.push(element);
}
if (joint.length === 0 && element.indexOf(joiner) !== element.length - 1)
{
output.push(element);
joint = [];
}
});
return output;
}
This might be a very late answer, but I am interested in answering这可能是一个很晚的答案,但我有兴趣回答
([\w]+|\"[\w\s]+\")
http://regex101.com/r/dZ1vT6/72 http://regex101.com/r/dZ1vT6/72
Pure javascript example纯 JavaScript 示例
'The rain in "SPAIN stays" mainly in the plain'.match(/[\w]+|\"[\w\s]+\"/g)
Outputs:输出:
["The", "rain", "in", ""SPAIN stays"", "mainly", "in", "the", "plain"]
Expanding on the accepted answer, here's a search engine parser that,扩展已接受的答案,这是一个搜索引擎解析器,
Treating phrases as regular expressions makes the UI simpler for my purposes.将短语视为正则表达式使 UI 更简单。
const matchOrIncludes = (str, search, useMatch = true) => { if (useMatch) { let result = false try { result = str.match(search) } catch (err) { return false } return result } return str.includes(search) } const itemMatches = (item, searchString, fields) => { const keywords = searchString.toString().replace(/\\s\\s+/g, ' ').trim().toLocaleLowerCase().match(/(-?"[^"]+"|[^"\\s]+)/g) || [] for (let i = 0; i < keywords.length; i++) { const negateWord = keywords[i].startsWith('-') ? true : false let word = keywords[i].replace(/^-/,'') const isPhraseRegex = word.startsWith('"') ? true : false if (isPhraseRegex) { word = word.replace(/^"(.+)"$/,"$1") } let word_in_item = false for (const field of fields) { if (item[field] && matchOrIncludes(item[field].toLocaleLowerCase(), word, isPhraseRegex)) { word_in_item = true break } } if ((! negateWord && ! word_in_item) || (negateWord && word_in_item)) { return false } } return true } const item = {title: 'My title', body: 'Some text'} console.log(itemMatches(item, 'text', ['title', 'body']))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.