简体   繁体   English

解析字符串:提取单词和短语 [JavaScript]

[英]parsings strings: extracting words and phrases [JavaScript]

I need to support exact phrases (enclosed in quotes) in an otherwise space-separated list of terms.我需要在一个以空格分隔的术语列表中支持精确的短语(括在引号中)。 Thus splitting the respective string by the space-character is not sufficient anymore.因此,用空格字符分割相应的字符串是不够的。

Example:例子:

input : 'foo bar "lorem ipsum" baz'
output: ['foo', 'bar', 'lorem ipsum', 'baz']

I wonder whether this could be achieved with a single RegEx, rather than performing complex parsing or split-and-rejoin operations.我想知道这是否可以通过单个 RegEx 来实现,而不是执行复杂的解析或拆分和重新连接操作。

Any help would be greatly appreciated!任何帮助将不胜感激!

var str = 'foo bar "lorem ipsum" baz';  
var results = str.match(/("[^"]+"|[^"\s]+)/g);

... returns the array you're looking for. ...返回您要查找的数组。
Note, however:但是请注意:

  • Bounding quotes are included, so can be removed with replace(/^"([^"]+)"$/,"$1") on the results.包含边界引号,因此可以在结果中使用replace(/^"([^"]+)"$/,"$1")删除。
  • Spaces between the quotes will stay intact.引号之间的空格将保持不变。 So, if there are three spaces between lorem and ipsum , they'll be in the result.所以,如果loremipsum之间有三个空格,它们就会出现在结果中。 You can fix this by running replace(/\\s+/," ") on the results.您可以通过对结果运行replace(/\\s+/," ")来解决此问题。
  • If there's no closing " after ipsum (ie an incorrectly-quoted phrase) you'll end up with: ['foo', 'bar', 'lorem', 'ipsum', 'baz']如果在ipsum之后没有结束" (即错误引用的短语),您将得到: ['foo', 'bar', 'lorem', 'ipsum', 'baz']

Try this:尝试这个:

var input = 'foo bar "lorem ipsum" baz';
var R =  /(\w|\s)*\w(?=")|\w+/g;
var output = input.match(R);

output is ["foo", "bar", "lorem ipsum", "baz"]

Note there are no extra double quotes around lorem ipsum请注意,lorem ipsum 周围没有额外的双引号

Although it assumes the input has the double quotes in the right place:虽然它假设输入在正确的位置有双引号:

var input2 = 'foo bar lorem ipsum" baz'; var output2 = input2.match(R);
var input3 = 'foo bar "lorem ipsum baz'; var output3 = input3.match(R);

output2 is ["foo bar lorem ipsum", "baz"]
output3 is ["foo", "bar", "lorem", "ipsum", "baz"]

And won't handle escaped double quotes (is that a problem?):并且不会处理转义的双引号(这是一个问题吗?):

var input4 = 'foo b\"ar  bar\" \"bar "lorem ipsum" baz';
var output4 = input4.match(R);

output4 is  ["foo b", "ar bar", "bar", "lorem ipsum", "baz"]

A simple regular expression will do but leave the quotation marks.一个简单的正则表达式就可以了,但要留下引号。 eg例如

'foo bar "lorem ipsum" baz'.match(/("[^"]*")|([^\s"]+)/g)
output:   ['foo', 'bar', '"lorem ipsum"', 'baz']

edit: beaten to it by shyamsundar, sorry for the double answer编辑:被 shyamsundar 打败了,抱歉双重回答

Thanks a lot for the quick responses!非常感谢您的快速回复!

Here's a summary of the options, for posterity:以下是选项的摘要,供后代使用:

var input = 'foo bar "lorem ipsum" baz';

output = input.match(/("[^"]+"|[^"\s]+)/g);
output = input.match(/"[^"]*"|\w+/g);
output = input.match(/("[^"]*")|([^\s"]+)/g)
output = /(".+?"|\w+)/g.exec(input);
output = /"(.+?)"|(\w+)/g.exec(input);

For the record, here's the abomination I had come up with:为了记录,这是我想出的可憎之处:

var input = 'foo bar "lorem ipsum" "dolor sit amet" baz';
var terms = input.split(" ");

var items = [];
var buffer = [];
for(var i = 0; i < terms.length; i++) {
    if(terms[i].indexOf('"') != -1) { // outer phrase fragment -- N.B.: assumes quote is either first or last character
        if(buffer.length === 0) { // beginning of phrase
            //console.log("start:", terms[i]);
            buffer.push(terms[i].substr(1));
        } else { // end of phrase
            //console.log("end:", terms[i]);
            buffer.push(terms[i].substr(0, terms[i].length - 1));
            items.push(buffer.join(" "));
            buffer = [];
        }
    } else if(buffer.length != 0) { // inner phrase fragment
        //console.log("cont'd:", terms[i]);
        buffer.push(terms[i]);
    } else { // individual term
        //console.log("standalone:", terms[i]);
        items.push(terms[i]);
    }
    //console.log(items, "\n", buffer);
}
items = items.concat(buffer);

//console.log(items);

ES6 solution supporting: ES6 解决方案支持:

  • Split by space except for inside quotes除内引号外,按空格分割
  • Removing quotes but not for backslash escaped quotes删除引号但不用于反斜杠转义引号
  • Escaped quote become quote转义报价成为报价

Code:代码:

input.match(/\\?.|^$/g).reduce((p, c) => {
        if(c === '"'){
            p.quote ^= 1;
        }else if(!p.quote && c === ' '){
            p.a.push('');
        }else{
            p.a[p.a.length-1] += c.replace(/\\(.)/,"$1");
        }
        return  p;
    }, {a: ['']}).a

Output:输出:

[ 'foo', 'bar', 'lorem ipsum', 'baz' ]
'foo bar "lorem ipsum" baz'.match(/"[^"]*"|\w+/g);

虽然边界引号被包括在内

how about,怎么样,

output = /(".+?"|\w+)/g.exec(input)

then do a pass on output to lose the quotes.然后传递输出以丢失引号。

alternately,交替,

output = /"(.+?)"|(\w+)/g.exec(input)

then do a pass n output to lose the empty captures.然后执行 pass n 输出以丢失空捕获。

One that's easy to understand and a general solution.一个易于理解和通用的解决方案。 Works for all delimiters and 'join' characters.适用于所有分隔符和“加入”字符。 Also supports 'joined' words that are more than two words in length.... ie lists like还支持长度超过两个单词的“连接”单词......即列表如下

"hello my name is 'jon delaware smith fred' I have a 'long name'" .... "hello my name is 'jon delaware smith fred' I have a 'long name'" ......

A bit like the answer by AC but a bit neater...有点像 AC 的回答,但更简洁...

function split(input, delimiter, joiner){
    var output = [];
    var joint = [];
    input.split(delimiter).forEach(function(element){
        if (joint.length > 0 && element.indexOf(joiner) === element.length - 1)
        {
            output.push(joint.join(delimiter) + delimiter + element);
            joint = [];
        }
        if (joint.length > 0 || element.indexOf(joiner) === 0)
        {
            joint.push(element);
        }
        if (joint.length === 0 && element.indexOf(joiner) !== element.length - 1)
        {
            output.push(element);
            joint = [];
        }
    });
    return output;
  }

This might be a very late answer, but I am interested in answering这可能是一个很晚的答案,但我有兴趣回答

([\w]+|\"[\w\s]+\")

http://regex101.com/r/dZ1vT6/72 http://regex101.com/r/dZ1vT6/72

Pure javascript example纯 JavaScript 示例

 'The rain in "SPAIN stays" mainly in the plain'.match(/[\w]+|\"[\w\s]+\"/g)

Outputs:输出:

["The", "rain", "in", ""SPAIN stays"", "mainly", "in", "the", "plain"]

Expanding on the accepted answer, here's a search engine parser that,扩展已接受的答案,这是一个搜索引擎解析器,

  • can match phrases or words可以匹配短语或单词
  • treats phrases as regular expressions将短语视为正则表达式
  • does a boolean OR across multiple properties (eg item.title and item.body)跨多个属性(例如 item.title 和 item.body)执行布尔 OR
  • handles negation of words or phrases when they are prefixed with -处理前缀为 - 的单词或短语的否定

Treating phrases as regular expressions makes the UI simpler for my purposes.将短语视为正则表达式使 UI 更简单。

 const matchOrIncludes = (str, search, useMatch = true) => { if (useMatch) { let result = false try { result = str.match(search) } catch (err) { return false } return result } return str.includes(search) } const itemMatches = (item, searchString, fields) => { const keywords = searchString.toString().replace(/\\s\\s+/g, ' ').trim().toLocaleLowerCase().match(/(-?"[^"]+"|[^"\\s]+)/g) || [] for (let i = 0; i < keywords.length; i++) { const negateWord = keywords[i].startsWith('-') ? true : false let word = keywords[i].replace(/^-/,'') const isPhraseRegex = word.startsWith('"') ? true : false if (isPhraseRegex) { word = word.replace(/^"(.+)"$/,"$1") } let word_in_item = false for (const field of fields) { if (item[field] && matchOrIncludes(item[field].toLocaleLowerCase(), word, isPhraseRegex)) { word_in_item = true break } } if ((! negateWord && ! word_in_item) || (negateWord && word_in_item)) { return false } } return true } const item = {title: 'My title', body: 'Some text'} console.log(itemMatches(item, 'text', ['title', 'body']))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM