简体   繁体   中英

split line via regex in javascript?

I have this structure of text :

1.6.1 Members................................................................ 12
1.6.2 Accessibility.......................................................... 13
1.6.3 Type parameters........................................................ 13
1.6.4 The T generic type aka <T>............................................. 13

I need to create JS objects :

{ 
  num:"1.6.1",
  txt:"Members"
},
{ 
  num:"1.6.2",
  txt:"Accessibility"
} ...

That's not a problem.

The problem is that I want to extract values via Regex split via positive lookahead :

Split via the first time you see that next character is a letter

在此输入图像描述

What have i tried :

'1.6.1 Members........... 12'.split(/\s(?=(?:[\w\. ])+$)/i)

This is working fine :

["1.6.1", "Members...........", "12"] // I don't care about the 12.

But If I have 2 words or more :

'1.6.3 Type parameters................ 13'.split(/\s(?=(?:[\w\. ])+$)/i)

The result is :

["1.6.3", "Type", "parameters................", "13"] //again I don't care about 13.

Of course I can join them , but I want the words to be together.

Question :

How can I enhance my regex NOT to split words ?

Desired result :

["1.6.3", "Type parameters"]

or

["1.6.3", "Type parameters........"] // I will remove extras later

or

["1.6.3", "Type parameters........13"] // I will remove extras later

NB

I know I can do split via " " or by other simpler solution but I'm seeking ( for pure knowledge) for an enhancement for my solution which uses positive lookahead split .

Full online example :

nb2 :

The text can contain capital letter in the middle also.

You can use this regex:

/^(\d+(?:\.\d+)*) (\w+(?: \w+)*)/gm

And get your desired matches using matched group #1 and matched group #2.

Online Regex Demo

Update: For String#split you can use this regex:

/ +(?=[A-Z\d])/g

Regex Demo

Update 2: With the possibility of having capital letters also in chapter names following more complex regex is needed:

var re = /(\D +(?=[a-z]))| +(?=[a-z\d])/gmi; 
var str = '1.6.3 Type Foo Bar........................................................ 13';
var m = str.split( re );
console.log(m[0], ',', m.slice(1, -1).join(''), ',', m.pop() );

//=> 1.6.3 , Type Foo Bar........................................................ , 13

EDIT: Since you added 1.6.1 The .net 4.5 framework.... to the requirements, we can tweak the answer to this:

^([\d.]+) ((?:[^.]|\.(?!\.))+)

And if you want to allow sequences of up to three dots in the title, as in 1.6.1 She said... Boo!........... , it's an easy tweak from there ( {3} quantifier):

^([\d.]+) ((?:[^.]|\.(?!\.{3}))+)

Original:

^([\d.]+) ([^.]+)

In the regex demo , see the Groups in the right pane.

To retrieve Groups 1 and 2, something like:

var myregex = /^([\d.]+) ((?:[^.]|\.(?!\.))+)/mg;
var theMatchObject = myregex.exec(yourString);
while (theMatchObject != null) {
    // the numbers: theMatchObject[1]
    // the title: theMatchObject[1]
    theMatchObject = myregex.exec(yourString);
}

OUTPUT

Group 1     Group 2
1.6.1       Members
1.6.2       Accessibility
1.6.3       Type parameters
1.6.4       The T generic type aka <T>**
1.6.1       The .net 4.5 framework

Explanation

  • ^ asserts that we are a the beginning of the line
  • The parentheses in ([\\d.]+) capture digits and dots to Group 1
  • The parentheses in ((?:[^.]|\\.(?!\\.))+) capture to Group 2...
  • [^.] one char that is not a dot, | OR...
  • \\.(?!\\.) a dot that is not followed by a dot...
  • + one or more times

You can use this pattern too:

var myStr = "1.6.1 Members................................................................ 12\n1.6.2 Accessibility.......................................................... 13\n1.6.3 Type parameters........................................................ 13\n1.6.4 The T generic type aka <T>............................................. 13";

console.log(myStr.split(/ (.+?)\.{2,} ?\d+$\n?/m));

About a way with a lookahead :

I don't think it is possible. Because the only way to skip a character (here a space between two words), is to match it on the occasion of the previous occurence of a space (between the number and the first word). In other words, you use the fact that characters can not be matched more than one time.

But if, except the space where you want to split, all the pattern is enclosed in a lookahead, and since the substring matched by this subpattern in the lookahead isn't a part of the match result (in other words, it's only a check and the corresponding characters are not eaten by the regex engine), you can't skip the next spaces, and the regex engine will continue his road until the next space character.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM