简体   繁体   English

字符串后的正则表达式多行匹配

[英]regex multi line match after string

I'm trying to extract the PROCEDURE section out of CLAIM , EOB & COB from a text file.我正在尝试从文本文件中提取CLAIMEOBCOB中的PROCEDURE部分。

and create an object like so并像这样创建一个 object

claim : [{PROCEDURE1}, {PROCEDURE2}, {PROCEDURE3}],
eob : [{PROCEDURE1}, {PROCEDURE2}, {PROCEDURE3}],
cob: [{PROCEDURE1}, {PROCEDURE2}, {PROCEDURE3}]

 let data = ` SEND CLAIM { PREFIX="9403 " PROCEDURE { /* #1 */ PROCEDURE_LINE="1" PROCEDURE_CODE="01201" } PROCEDURE { /* #2 */ PROCEDURE_LINE="2" PROCEDURE_CODE="02102" } PROCEDURE { /* #3 */ PROCEDURE_LINE="3" PROCEDURE_CODE="21222" } } SEND EOB { PREFIX="9403 " OFFICE_SEQUENCE="000721" PROCEDURE { /* #1 */ PROCEDURE_LINE="1" ELIGIBLE="002750" } PROCEDURE { /* #2 */ PROCEDURE_LINE="2" ELIGIBLE="008725" } PROCEDURE { /* #3 */ PROCEDURE_LINE="3" ELIGIBLE="010200" } } SEND COB { PREFIX="TEST4 " OFFICE_SEQUENCE="000721" PROCEDURE { /* #1 */ PROCEDURE_LINE="1" PROCEDURE_CODE="01201" } PROCEDURE { /* #2 */ PROCEDURE_LINE="2" PROCEDURE_CODE="02102" } PROCEDURE { /* #3 */ PROCEDURE_LINE="3" PROCEDURE_CODE="21222" DATE="19990104" } PRIME_EOB=SEND EOB { PREFIX="9403 " OFFICE_SEQUENCE="000721" PROCEDURE { /* #1 */ PROCEDURE_LINE="1" ELIGIBLE="002750" } PROCEDURE { /* #2 */ PROCEDURE_LINE="2" ELIGIBLE="008725" } PROCEDURE { /* #3 */ PROCEDURE_LINE="3" ELIGIBLE="010200" } } }` let re = /(^\s+PROCEDURE\s\{)([\S\s]*?)(?:})/gm console.log(data.match(re));

Here is what I have tried so far (^\s+PROCEDURE\s\{)([\S\s]*?)(?:}) , but I can't figure out how I can match PROCEDURE s after key CLAIM or EOB这是我到目前为止尝试过(^\s+PROCEDURE\s\{)([\S\s]*?)(?:}) ,但我不知道如何在 key 之后匹配PROCEDURE CLAIMEOB

For "claim", you could match the following regular expression.对于“声明”,您可以匹配以下正则表达式。

/(?<=^ *SEND CLAIM +\{\r?\n(?:^(?! *SEND EOB *\{)(?! *SEND COB *\{).*\r?\n)*^ *PROCEDURE *)\{[^\}]*\}/

CLAIM regex声明正则表达式

This matches the following strings, which I assume can be easily saved to an array with a sprinkling of Javascript code.这与以下字符串匹配,我假设可以轻松地将其保存到带有 Javascript 代码的数组中。

         { /* CLAIM #1  */  
   PROCEDURE_LINE="1"
   PROCEDURE_CODE="01201"
    
}

          { /* CLAIM #2  */
   PROCEDURE_LINE="2"
   PROCEDURE_CODE="02102"
  
}

          { /* CLAIM #3  */
   PROCEDURE_LINE="3"
   PROCEDURE_CODE="21222"
   
}

Javascript's regex engine performs the following operations. Javascript 的正则表达式引擎执行以下操作。

(?<=                 : begin positive lookbehind
  ^                  : match beginning of line
  \ *SEND CLAIM\ +   : match 'SEND CLAIM' surrounded by 0+ spaces
  \{\r?\n            : match '{' then line terminators
  (?:                : begin non-capture group
    ^                : match beginning of line
    (?!              : begin negative lookahead
      \ *SEND EOB\ * : match 'SEND EOB' surrounded by 0+ spaces
      \{             : match '{'
    )                : end negative lookahead
    (?!              : begin negative lookahead
      \ *SEND COB\ * : match 'SEND COB' surrounded by 0+ spaces
      \{             : match '{'
    )                : end negative lookahead
    .*\r?\n          : match line including terminators
  )                  : end non-capture group
  *                  : execute non-capture group 0+ times
  ^                  : match beginning of line
  \ *PROCEDURE\ *    : match 'PROCEDURE' surrounded by 0+ spaces 
)                    : end positive lookbehind
\{[^\}]*\}           : match '{', 0+ characters other than '}', '}' 

I've escaped space characters above to improve readability.我已经在上面转义了空格字符以提高可读性。

For "eob", use the slightly-modified regex:对于“eob”,使用稍微修改的正则表达式:

/(?<=^ *SEND EOB +\{\r?\n(?:^(?! *SEND CLAIM *\{)(?! *SEND COB *\{).*\r?\n)*^ *PROCEDURE *)\{[^\}]*\}/

EOB regex EOB 正则表达式

I've made no attempt to do the same for "cob" as that part has a different structure than "claim" and "eob" and it is not clear to me how it is to be treated.我没有尝试对“cob”做同样的事情,因为该部分的结构与“claim”和“eob”不同,我不清楚如何处理它。

A final note, should it not be obvious: it would be far easier to extract the strings of interest using convention code with loops and, possibly, simple regular expressions, but I hope some readers may find my answer instructive about some elements of regular expressions.最后一点,如果不是很明显:使用带有循环的约定代码和可能的简单正则表达式来提取感兴趣的字符串会容易得多,但我希望一些读者可能会发现我的回答对正则表达式的某些元素具有指导意义.

Will CLAIM , EOB and COB always be in the same order? CLAIMEOBCOB是否总是相同的顺序? If so, you can split the text before using the regex you already have:如果是这样,您可以在使用已有的正则表达式之前拆分文本:

const procRegex = /(^\s+PROCEDURE\s\{)([\S\s]*?)(?:})/gm;

let claimData = data.split("EOB")[0];
let claimProcedures = claimData.match(procRegex);

let eobData = data.split("COB")[0].split("EOB")[1];
let eobProcedures = eobData.match(procRegex);

let cobData = data.split("COB")[1];
let cobProcedures = cobData.match(procRegex);

// If you want to leave out the PRIME_EOB, you can split COB again
cobData = cobData.split("EOB")[0];
cobProcedures = cobData.match(procRegex);

console.log(claimProcedures);

Output: Output:

[
  '       PROCEDURE { /* #1  */\n' +
    '          PROCEDURE_LINE="1"\n' +
    '          PROCEDURE_CODE="01201"\n' +
    '        \n' +
    '       }',
  '       PROCEDURE { /* #2  */\n' +
    '          PROCEDURE_LINE="2"\n' +
    '          PROCEDURE_CODE="02102"\n' +
    '         \n' +
    '       }',
  '       PROCEDURE { /* #3  */\n' +
    '          PROCEDURE_LINE="3"\n' +
    '          PROCEDURE_CODE="21222"\n' +
    '       \n' +
    '       }'
]

Demo 演示

As an alternate method, your data is not terribly far away from valid JSON, so you could run with that.作为一种替代方法,您的数据与有效的 JSON 相距不远,因此您可以使用它运行。 The code below translates the data into JSON, then parses it into a Javascript object that you can use however you want.下面的代码将数据转换为 JSON,然后将其解析为 Javascript object,您可以随意使用。

/* data cannot have Javascript comments in it for this to work, or you need
   another regex to remove them */

data = data.replace(/=/g, ":") // replace = with :
  .replace(/\s?{/g, ": {") // replace { with : {
  .replace(/SEND/g, "") // remove "SEND"
  .replace(/\"\s*$(?!\s*\})/gm, "\",") // add commas after object properties
  .replace(/}(?=\s*\w)/g, "},") // add commas after objects
  .replace(/(?<!\}),\s*PROCEDURE: /g, ",\nPROCEDURES: [") // start procedures list
  .replace(/(PROCEDURE:[\S\s]*?\})\s*(?!,\s*PROCEDURE)/g, "$1]\n") // end list
  .replace(/PROCEDURE: /g, "") // remove "PROCEDURE"
  .replace("PRIME_EOB: EOB:", "PRIME_EOB:") // replace double key with single key. Is this the behavior you want?
  .replace(/(\S*):/g, "\"$1\":") // put quotes around object key names

let dataObj = JSON.parse("{" + data + "}");

console.log(dataObj.CLAIM.PROCEDURES);

Output: Output:

[ { PROCEDURE_LINE: '1', PROCEDURE_CODE: '01201' },
  { PROCEDURE_LINE: '2', PROCEDURE_CODE: '02102' },
  { PROCEDURE_LINE: '3', PROCEDURE_CODE: '21222' } ]

Demo演示

What you are trying to do is to write a parser for the syntax used in your text file.您要做的是为文本文件中使用的语法编写一个解析器。
If one looks at the syntax it looks much like JSON.如果看一下语法,它看起来很像 JSON。
I would recommend to modify the syntax with regexps to get a valid JSON syntax and parse it with the JavaScript JSON parser.我建议使用正则表达式修改语法以获得有效的 JSON 语法并使用 JavaScript JSON 解析器对其进行解析。 The parser is able to handle recursion.解析器能够处理递归。 At the end you will have a JavaScript object that allows you to remove- or add whatever you need.最后,您将拥有一个 JavaScript object,允许您删除或添加任何您需要的东西。 In addition the hierarchy of the source will be preserved.此外,源的层次结构将被保留。

This code does the job for the provided example:此代码为提供的示例完成了工作:

let data = `    SEND CLAIM {
// your text file contents
}`;

// handle PRIME_EOB=SEND EOB {
var regex = /(\w+)=\w+.*{/gm;
var replace = data.replace(regex, "$1 {");

// append double quotes in lines like PROCEDURE_LINE="1"
var regex = /(\w+)=/g;
var replace = replace.replace(regex, "\"$1\": ");

// append double quotes in lines like PROCEDURE {
var regex = /(\w+.*)\s{/g;
var replace = replace.replace(regex, "\"$1\": {");

// remove comments: /* */
var regex = /\/\**.*\*\//g;
var replace = replace.replace(regex, "");

// append commas to lines i.e. "PROCEDURE_LINE": "2"
var regex = /(\".*\":\s*\".*\")/gm;
var replace = replace.replace(regex, "$1,");

// append commas to '}'
var regex = /^.*}.*$/gm;
var replace = replace.replace(regex, "},");

// remove trailing commas
var regex = /\,(?!\s*?[\{\[\"\'\w])/g;
var replace = replace.replace(regex, "");

// surround with {}
replace = "{" + replace + "}";

console.log(replace);
var obj = JSON.parse(replace);
console.log(obj);

The JSON looks like this snippet: JSON 看起来像这个片段:

{    "SEND CLAIM": {
       "PREFIX": "9403        ",
       "PROCEDURE": { 
          "PROCEDURE_LINE": "1",
          "PROCEDURE_CODE": "01201"
        
},
       "PROCEDURE": { 
          "PROCEDURE_LINE": "2",
          "PROCEDURE_CODE": "02102"

And the final object appears in the debugger like this最终的 object 像这样出现在调试器中在此处输入图像描述 . .

It is not completely clear to me what your final array or object should look like.我并不完全清楚您的最终阵列或 object 应该是什么样子。 But from here I expect only little effort to produce what you desire.但从这里开始,我预计只需很少的努力就能产生你想要的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM