[英]Parsing book sources with Regex in JavaScript
I am currently building a parser that is supposed to extract different sources from an absolute mess :) I've been working on it for a couple of days and it's working just fine. 我目前正在构建一个解析器,该解析器应该从绝对的混乱中提取不同的资源:)我已经工作了几天,而且运行正常。 However, I encountered a serious problem when trying to parse the last segments of a book.
但是,在尝试解析书的最后部分时遇到了一个严重的问题。 There is no character that can really help me separating stuff:
没有可以真正帮助我分离事物的角色:
var str = 'John Doe, Max Mustermann, Taro Tanaka, My Mean Title: Some titles are just totally, absolutely, and unnecessarily mean';
As you can see, the string contains names separated by a comma and a title that contains a comma but does not require quotes around it. 如您所见,该字符串包含用逗号分隔的名称和包含逗号但不需要用引号引起来的标题。 Also, there are similar versions in my testdata which look like this:
另外,我的测试数据中也有类似的版本,如下所示:
var str = 'John Doe, Max Mustermann, Taro Tanaka: My Mean Title: Some titles are just totally, absolutely, and unnecessarily mean';
This doesn't make it easier. 这并没有使其变得容易。 What I want is to store the book's title in an object (which already contains date, publisher,...) and, afterwards, remove the title from the source string.
我想要的是将书的标题存储在一个对象中(该对象已经包含日期,出版商,...),然后从源字符串中删除标题。 I'd be very happy if someone could help me out :)
如果有人可以帮助我,我将非常高兴:)
Here's a fiddle to play around with: http://jsfiddle.net/TheFatalist/927645vz/1/ However, I'd recommend using this tool: http://leaverou.github.io/regexplained/ 这是一个小玩意儿: http : //jsfiddle.net/TheFatalist/927645vz/1/但是,我建议使用此工具: http : //leaverou.github.io/regexplained/
Thanks a lot in advance! 在此先多谢! I will update the fiddle, as soon as I can figure something out.
我会尽快弄清小提琴。
Edit: To avoid confusion: I am searching for the regex that separates title and name. 编辑:为避免混淆:我正在搜索分隔标题和名称的正则表达式。 Or another workaround.
或其他解决方法。 I hope there is some kind of way to identify this... but I cannot figure it out.
我希望可以通过某种方式来识别此问题...但是我无法弄清楚。
As @nnnnnn states it's hard to do this in a very reliable manner but may get somewhere when you try to match from the end of the string: 正如@nnnnnn指出的那样,很难以非常可靠的方式执行此操作,但是当您尝试从字符串末尾进行匹配时,它可能会到达某个位置:
var str = 'John Doe, Max Mustermann, Taro Tanaka, My Mean Title: Some titles are just totally, absolutely, and unnecessarily mean'; var str2 = 'John Doe, Max Mustermann, Taro Tanaka: My Mean Title: Some titles are just totally, absolutely, and unnecessarily mean'; // assume all characters after semicolon as title and include all characters and whitespace before the semicolon // everything before the title is assumed to be authors var regex = /(.*?)((\\w|\\s)+:[^:]+)$/; var str_match = regex.exec(str); $('body').append('<br>string: "'+str+'"<br>title: '+ str_match[2]+'<br>authors: '+str_match[1]); $('body').append('<br><br>'); var str2_match = regex.exec(str2); $('body').append('<br>string: "'+str2+'"<br>title: '+ str2_match[2]+'<br>authors: '+str2_match[1]);
^(.*?)(?:,(?=[^,]*:)|\s(?=\w+:))(.*)$
Try this.Grab the matches.Match 2
contains title detail
试试看。抓住比赛。比赛
2
包含title detail
Or simply use regex.split to get your results with this re. 或者只是使用regex.split来获得此结果。
See demo. 参见演示。
http://regex101.com/r/kM7rT8/5 http://regex101.com/r/kM7rT8/5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.