简体   繁体   中英

Regex.Split() sentence to words while preserving whitespace

我正在使用Regex.Split()来获取用户输入并将其转换为列表中的单个单词,但此时它会删除它们添加的任何空格,我希望它保留空白。

string[] newInput = Regex.Split(updatedLine, @"\s+");
string text = "This            is some text";
var splits = Regex.Split(text, @"(?=(?<=[^\s])\s+)");

foreach (string item  in splits)
    Console.Write(item);
Console.WriteLine(splits.Count());

This will give you 4 splits each having all the leading spaces preserved.

(?=\s+)

Means split from the point where there are spaces ahead. But if you use this alone it will create 15 splits on the sample text because every space is followed by another space in case of repeated spaces.

(?=(?<=[^\s])\s+)

This means split from a point which has non space character before it and it has spaces ahead of it.

If the text starts from a space and you want that to be captured in first split with no text then you can modify the expression to following

(?=(?<=^|[^\s])\s+)

Which means series of spaces need to have a non space character before it OR start of the string.

I'm guessing that some of the "words" you're interested in are actually phrases where spaces are acceptable. You can't easily use the space character as both a phrase delimiter and an allowable character within the phrase itself. Try using a comma for a delimiter instead:

string updatedLine = "user,input,two words,even three words";
string[] newInput = Regex.Split(updatedLine, @",");

This version of the regex allows trailing spaces after the commas:

string updatedLine = "user, input,   two words,    even three words";
string[] newInput = Regex.Split(updatedLine, @",\s+|,");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM