简体   繁体   English

拆分多字字符串时,如何在不包含数字的单词之间保留空格?

[英]How can I retain spaces between words that don't contain numbers when splitting a multi-word string?

Is it possible to split a string by space but leave names together ?是否可以按空格拆分字符串但将名称保留在一起

Example:例子:

"1 23565 john smith 01/01/2021 another"

Expected:预期的:

string[] {"1", "23565", "john smith", "01/01/2021", "another"}

Names in this case are any word in the string that don't contain numbers.在这种情况下,名称是字符串中不包含数字的任何单词。 "Name" words are always preceded and succeeded by "number" words. “名称”词总是在“数字”词之前和之后。

You can try regular expressions , eg您可以尝试正则表达式,例如

 using System.Text.RegularExpressions;

 ...

 string source = "1 23565 john smith 01/01/2021 another";

 string[] result = Regex.Split(source, @"(?<=\P{L})\s+|\s+(?=\P{L})");

 // Let's have a look:
 Console.WriteLine(string.Join(", ", result));

Outcome:结果:

 1, 23565, john smith, 01/01/2021, another

Here I've put (?<=\\P{L})\\s+|\\s+(?=\\P{L}) pattern:这里我放了(?<=\\P{L})\\s+|\\s+(?=\\P{L})模式:

(?<=\P{L})\s+ - look behind (not a letter) 
                then one or more whitespaces
|             - or
\s+(?=\P{L})  - one or more whitespaces and 
                then (look ahead) not a letter

This seems like an XY problem , so I'm going to focus on solving X instead of Y.这似乎是一个XY 问题,所以我将专注于解决 X 而不是 Y。

No and yes.不,是的。

There may be some Regex to do this for you, but that includes extra complexity and overhead.可能有一些 Regex 可以为您执行此操作,但这包括额外的复杂性和开销。 And I don't know regex well enough to offer an example.而且我不太了解正则表达式,无法提供示例。

If you try just a regular string.split , then you can't avoid splitting the name.如果您仅尝试使用常规string.split ,则无法避免拆分名称。 However, if you know that the string will be formatted exactly the same way 100% of the time, you can concatenate the 2nd & 3rd instance of the split string back together, but it would be a manual process and it would break if you ever changed the string format.但是,如果您知道字符串将在 100% 的情况下以完全相同的方式格式化,您可以将拆分字符串的第二个和第三个实例重新连接在一起,但这将是一个手动过程,如果您有的话它会中断更改了字符串格式。

string value = "1 23565 john smith 01/01/2021 another";
List<string> values = value.Split(" ");
values[2] += " " + values[3];
values.RemoveAt(3);

A better option might be to avoid this altogether by using JSON or XML .更好的选择可能是通过使用JSONXML来完全避免这种情况。

These are both data transfer specifications which allow you to keep different pieces of data separated while still transferring them together.这些都是数据传输规范,允许您将不同的数据分开,同时仍将它们一起传输。

Examples:例子:

// JSON
{
  id: 1,
  randomNumber: 23565,
  name: "john smith",
  hireDate: "01/01/2021",
  description: "another"
}

// XML
<?xml version="1.0" encoding="UTF-8"?>
<id>1</id>
<randomNumber>23565</randomNumber>
<name>john smith</name>
<hireDate>01/01/2021</hireDate>
<description>another</description>

As you might notice, XML has a bit more complexity and character count to it.您可能已经注意到,XML 具有更多的复杂性和字符数。 It's still a very popular method of data transport, but the smaller size and easier to understand format of JSON are causing people to convert more projects over to JSON.它仍然是一种非常流行的数据传输方法,但是 JSON 的较小尺寸和更易于理解的格式导致人们将更多项目转换为 JSON。

There are plenty of existing libraries to convert data into and out of both of these formats, so you don't have to do any of that.有很多现有的库可以将数据转换为这两种格式,因此您无需执行任何操作。 The C# language itself has some conversion for these formats built-in, but that has limits and caveats that tend to make people use the external libraries (commonly found on NuGet). C# 语言本身对这些格式有一些内置的转换,但是有一些限制和警告,往往使人们使用外部库(通常在 NuGet 上找到)。

And with the generally fast internet speeds people have for even their home and cell use, the small amount of extra overhead needed for these formats generally doesn't outweigh the ease of using the data formats.由于人们甚至在家中和手机上使用的互联网速度普遍很快,因此这些格式所需的少量额外开销通常不会超过使用数据格式的便利性。

We are asked to split the string on spaces other than spaces between names, where a name is any word in the string that does not contain a number.我们被要求在名称之间的空格以外的空格上拆分字符串,其中名称是字符串中不包含数字的任何单词。

One can split on matches of the following:可以拆分以下匹配项:

(?<=(?:^| )\S*\d\S*) | (?=\S*\d\S*(?: |\z))

Demo演示

The matches are shown below by the party hats.比赛由派对帽显示如下。

1 23565 john smith 01/01/2021 another
 ^     ^          ^          ^

1 23565 john smith1 01/01/2021 another
 ^     ^    ^      ^          ^

1 23565 1john smith 01/01/2021 another
 ^     ^     ^     ^          ^

1 23565 jo1hn sm1th 01/01/2021 another
 ^     ^     ^     ^          ^

The regular expression has the following elements:正则表达式具有以下元素:

(?<=        # begin positive lookbehind
  (?:^| )   # match beginning of string or space in a non-capture group
  \S*\d\S*  # match 1+ chars other than whitespace followed by
            # a digits followed by 1+ chars other than whitespace
)           # end negative lookbehind
[ ]         # match a space
|           # or
[ ]         # match a space
(?=         # begin positive lookahead
  \S*\d\S*  # match 1+ chars other than whitespace followed by
            # a digits followed by 1+ chars other than whitespace
  (?: |\z)  # match a space or end of string in a non-capture group
)           # end positive lookahead

I've put each of the spaces in a character class ( [ ] ) to make them visible.我已将每个空格放在一个字符类 ( [ ] ) 中以使其可见。

This regular expression makes use of the fact that C# 's regex engine supports variable-length lookbacks, a feature most regex engines lack.此正则表达式利用了C#的正则表达式引擎支持可变长度回溯的事实,这是大多数正则表达式引擎所缺乏的功能。

Depending on requirements, one may want to replace \\S*\\d\\S* with [az\\d]*\\d[az\\d\\* , possibly with the case-indifferent flat ( i ) set.根据要求,人们可能希望将\\S*\\d\\S*替换为[az\\d]*\\d[az\\d\\* ,可能使用大小写无关的平面 ( i ) 集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM