简体   繁体   中英

Cleaning String with Java Script

I have a small Node Script which is web Scraping a Web Page. From that page I am extracting an array of Strings.

I am trying to clean up those Strings (currently with regex and string.replace)

One example String looks like this:

2 Glücklich sind die, die seine Erinnerungen beachten,+die mit ganzem Herzen nach ihm suchen.+\\n

My cleaning code looks like this.

string.replace(/\+/g, '').replace(/\*/g, '').replace('\n', '').replace(/(^\d+)/g, '').trim()

The first section removes all "+", the second removes all *, the third removes the new Line and the last one removes the leading number.

The most things work fine but I have some edge cases. This is my Result:

2 Glücklich sind die,die seine Erinnerungen beachten,die mit ganzem Herzen nach ihm suchen.

Problems:

  1. The Leading Number was not removed (when the number has two or more digits it gets always removed, i have no Idea why a Single digit stays the same.)
  2. The first * got removed but because there was no whitespace there is no space anymore ;(. The second * was followed by a white space... so no Problems here.
  3. Same issue with the "+"... no whitespace following so the words stick together

My goal is to parse every String correctly. I have thousands of strings with different combinations but only "+", *, "\\n" and the number as special characters.

The String should look like this:

Glücklich sind die, die seine Erinnerungen beachten, die mit ganzem Herzen nach ihm suchen.

Hopefully someone has an idea to accomplish that.

You could use an alternation | with a character class [+*\\n] to match either one of the characters or 1+ digits ^\\d+ at the start of the string.

[+*\n]|^\d+

Regex demo

In the replacement use a space. Afterwards, replace all the 2 or more spaces with a single space.

 let pattern = /[+*\\n]|^\\d+/g; let string = "2 Glücklich sind die,*die seine Erinnerungen* beachten,+die mit ganzem Herzen nach ihm suchen.+\\n"; string = string .replace(pattern, " ") .replace(/[ ]{2,}/g, " ") .trim(); console.log(string);


If the digits at the start of the string can be preceded by optional whitespace chars, you could match those as well by matching 0+ times a whitespace char except a newline ^[^\\S\\r\\n]*\\d+

 let pattern = /[+*\\n]|^[^\\S\\r\\n]*\\d+/g; let string = " 2 Glücklich sind die,*die seine Erinnerungen* beachten,+die mit ganzem Herzen nach ihm suchen.+\\n"; string = string .replace(pattern, " ") .replace(/[ ]{2,}/g, " ") .trim(); console.log(string);

Perhaps this?

 let str = `2 Glücklich sind die,*die seine Erinnerungen* beachten,+die mit ganzem Herzen nach ihm suchen.+\\n` str = str.replace(/[\\*\\+]/g," ") .replace(/^\\d+(\\s+)?/,"") // or add .trim() .replace(/\\n?/,"") .replace(/\\s{2,}/g," ") console.log(str)

You can achieve all your goals with a fairly short regex, and a single call to String.prototype.replace :

 let cleanStr = str => str.replace(/^[0-9\\s]*|[+*\\r\\n]/g, ''); console.log(cleanStr('2 Glücklich sind die,die seine Erinnerungen beachten,+die mit ganzem Herzen nach ihm suchen.+\\n'));

This regex detects either ^[0-9\\s]* or [+*\\r\\n] (and these sequences will be replaced with the empty string).

^[0-9\\s]* replaces any number of consecutive digit or whitespace characters at the beginning of the string.

^[+*\\r\\n] removes any "+", "*", or newline characters (including \\r , which could be significant in windows environments) which occur anywhere in the string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM