简体   繁体   English

高级正则表达式:基于不同名称的多个变体来拆分字符串,将分隔符保留在自己的数组项中

[英]Advanced regex: Split string based on multiple variations of different names, retain delimiters in their own array item

I'm trying to build a Javascript program that switches multiple variations of names with each other. 我正在尝试构建一个Javascript程序,以相互切换名称的多个变体。

For example, if I had a string: 例如,如果我有一个字符串:

let string = "This is Donald Trump and I am Donald J. Trump and I have replaced Barack Obama and Obama was before me."

I would want the output to be: 我希望输出为:

newString = "This is Barack Obama and I am Barack H. Obama and I have replaced Donald Trump and Trump was before me."

My strategy was to use 我的策略是使用

 let arr = string.split(regex)

in such a way that each chunk of text before and after a regex match is its own index, and each regex match is its own index too. 这样,正则表达式匹配之前和之后的每个文本块都是自己的索引,而每个正则表达式匹配也都是自己的索引。 For example: 例如:

["This is ", "Donald Trump", " and I am ", "Donald J. Trump", " and I have replaced ", "Barack Obama", " and ", "Obama", " was before me."];

Then check each item of the array to see if it needs to be "switched." 然后检查阵列的每个项目以查看是否需要“切换”。 For example: 例如:

for (let i = 0; i < arr.length; i++) {
  // if arr[i] == Donald J. Trump, Donald Trump, or Trump, arr[i] = equivalent Obama variation
  // else if arr[i] == Barack H. Obama, Barack Obama, or Obama, arr[i] = equivalent Trump variation
  // else arr[i] = arr[i]
}
let newString = arr.join(" ");
htmlElement.innerHTML(newString);

Here's my regex 这是我的正则表达式

let regex = /\b(Barack\s)?(H\.\s)?Obama|\b(Donald\s)?(J\.\s)?Trump/;

The regex seems to correctly match all variations of the names. 正则表达式似乎正确匹配了名称的所有变体。

However, when I write 但是,当我写

arr = string.split(regex)

my arr looks like this: 我的arr看起来像这样:

["This is ", undefined, undefined, "Donald ", undefined, " and I am ", undefined, undefined, "Donald ", "J. ", " and I have replaced ", undefined, "Barack ", undefined, undefined, " and ", undefined, undefined, undefined, undefined, " was before me."];

Is there a way to split the string by the multiple variations of the delimiter, but also retain the delimiter in its own array item? 有没有一种方法可以通过分隔符的多个变体来拆分字符串,但又可以将分隔符保留在自己的数组项中?

Code

I took a different approach to your problem. 对于您的问题,我采取了另一种方法。 Instead of searching for specific names I created a regex that captures full names (assuming each name begins with a capital letter and has more than 1 character or is immediately followed by a dot). 我没有搜索特定的名称,而是创建了一个捕获全名的正则表达式(假设每个名称以大写字母开头且具有多个字符,或者紧随其后是一个点)。 I then crossreference this full name (split on spaces) against a nameEquivalents object for the proper replacement. 然后,我将此全名(用空格分隔)与nameEquivalents对象进行nameEquivalents ,以进行适当的替换。

Yes, I am aware that the regex will not catch special cases such as names with two-letter abbreviations, apostrophes, hyphens, starting with non-uppercase letters, etc. but the need wasn't specified by the OP (and frankly I'm not too worried about it since my regex could capture more than the OP's original regex of simply putting the names directly in it). 是的,我知道正则表达式不会捕获特殊情况,例如带有两个字母的缩写,撇号,连字符,以非大写字母开头的名称等特殊情况,但是OP并未指定需要(坦率地说,我是我不太担心这个问题,因为我的regex可以比OP的原始regex捕获更多(只需将名称直接放入其中)。

Also, note that the getKeyByValue function is taken from this answer on this question . 另外,请注意, getKeyByValue函数是从此问题的 答案中获取的。

 let string = "This is Donald Trump and I am Donald J. Trump and I have replaced Barack Obama and Obama was before me." let regex = /(?: ?\\b[AZ](?:[a-zA-Z]+\\b|\\.))+/g let nameEquivalents = { "Obama": "Trump", "Barack": "Donald", "H.": "J." } function getKeyByValue(object, value) { return Object.keys(object).find(key => object[key] === value); } let newString = string.replace(regex, function(match) { matches = match.split(" ").filter(String) return matches.map(function(m){ if(nameEquivalents.hasOwnProperty(m)) { return " " + nameEquivalents[m] } else { let v = getKeyByValue(nameEquivalents, m) if(v) { return " " + v } } return m }).join("") }) console.log(newString) 


Explanation 说明

  • (?: ?\\b[AZ](?:[a-zA-Z]+|\\.))+ Match the following one or more times (?: ?\\b[AZ](?:[a-zA-Z]+|\\.))+匹配以下一次或多次
    • ? Optionally match a space character (there's a space (可选)匹配一个空格字符(有一个空格 before the ? 之前? but SO doesn't actually display it there) 但实际上并没有在那里显示它)
    • \\b Assert position as a word boundary \\b位置为单词边界
    • [AZ] Match an uppercase letter [AZ]匹配一个大写字母
    • (?:[a-zA-Z]+\\b|\\.) Match either of the following (?:[a-zA-Z]+\\b|\\.)匹配以下任意一个
      • [a-zA-Z]+\\b Match any letter one or more times ensuring it's followed by a word boundary [a-zA-Z]+\\b匹配任何字母一次或多次,确保其后跟单词边界
      • \\. Match a literal dot 匹配文字点

I think the parentheses in the regex are being interpreted as capture groups and so in matches that dont fulfill all captures you are getting undefined captures. 我认为正则表达式中的括号被解释为捕获组,因此在不满足所有捕获的匹配中,您将获得未定义的捕获。

Try removing all parenthesis and just wrapping the whole lot in a single capture. 尝试删除所有括号,然后将全部封装在一个捕获中。

 /\b(Barack\s?H\.\s?Obama|\bDonald\s?J\.\s?Trump)/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM