简体   繁体   English

C#如何从字符串中提取单词并将其放入类成员

[英]C# How to extract words from a string and put them into class members

I have a problem with c# string manipulation and I'd appreciate your help. 我在处理C#字符串时遇到问题,非常感谢您的帮助。 I have a file that contains many lines. 我有一个包含很多行的文件。 It looks like this: 看起来像这样:

firstWord   number(secondWord)    thirdWord(Phrase)  Date1  Date2
firstWord number(secondWord)         thirdWord(Phrase)   Date1     Time1
...

I need to separate these words and put them in a class properties. 我需要将这些单词分开并将它们放在类属性中。 As you can see the problem is that the spaces between words are not the same, sometimes is one space sometimes eight spaces between them. 如您所见,问题在于单词之间的间隔不相同,有时是一个间隔,有时是八个间隔。 And the second problem is that on the third place comes a phrase containing 2 to 5 words (again divided by spaces or sometimes contected with _ or -) and it needs to be considered as one string - it has to be one class member. 第二个问题是,在第三位出现的词组包含2到5个单词(再次由空格分隔,有时有时与_或-冲突),并且必须将其视为一个字符串-它必须是一个类成员。 The class should look like this: 该类应如下所示:

class A
string a = firstWord;
int b = number;
string c = phrase;
Date d = Date1;
Time e = Time1;

I'd appreciate if you had any ideas how to solve this. 如果您有任何解决方法的想法,我们将不胜感激。 Thank you. 谢谢。

Use the following steps: 使用以下步骤:

  1. Use File.ReadAllLines() to get a string[] , where each element represents one line of the file. 使用File.ReadAllLines()获得string[] ,其中每个元素代表文件的一行。
  2. For each line, use string.Split() and chop your line into individual words. 对于每一行,使用string.Split()并将您的行切成单个单词。 Use both space and parentheses as your delimiters. 使用空格和括号作为分隔符。 This will give you an array of words. 这将为您提供一系列单词。 Call it arr . 称之为arr
  3. Now create an object of your class and assign like this: 现在创建您的类的对象并按以下方式分配:

     string a = arr[0]; int b = int.Parse(arr[1]); string c = string.Join(" ", arr.Skip(4).Take(arr.Length - 6)); Date d = DateTime.Parse(arr[arr.Length - 2]); Date e = DateTime.Parse(arr[arr.Length - 1]); 

The only tricky stuff is string c above. 唯一棘手的是上面的string c Logic here is that from element no. 这里的逻辑是从元素号开始。 4 up to the 3rd last element, all of these elements form your phrase part, so we use linq to extract those elements and join them together to get back your phrase. 从第4个元素到最后第3个元素,所有这些元素都构成了您的短语部分,因此我们使用linq提取这些元素并将它们结合在一起以获取您的短语。 This would obviously require that the phrase itself doesn't contain any parentheses itself, but that shouldn't normally be the case I assume. 显然,这要求该短语本身不包含任何括号,但是我通常不应该这样。

You need a loop and string - and TryParse -methods: 您需要一个循环和string -和TryParse方法:

var list = new List<ClassName>();
foreach (string line in File.ReadLines(path).Where(l => !string.IsNullOrEmpty(l)))
{
    string[] fields = line.Trim().Split(new char[] { }, StringSplitOptions.RemoveEmptyEntries);
    if (fields.Length < 5) continue;

    var obj = new ClassName();
    list.Add(obj);

    obj.FirstWord = fields[0];

    int number;
    int index = fields[1].IndexOf('(');
    if (index > 0 && int.TryParse(fields[1].Remove(index), out number))
        obj.Number = number;

    int phraseStartIndex = fields[2].IndexOf('(');
    int phraseEndIndex = fields[2].LastIndexOf(')');
    if (phraseStartIndex != phraseEndIndex)
    {
        obj.Phrase = fields[2].Substring(++phraseStartIndex, phraseEndIndex - phraseStartIndex);
    }

    DateTime dt1;
    if(DateTime.TryParse(fields[3], out dt1))
        obj.Date1 = dt1;

    DateTime dt2;
    if (DateTime.TryParse(fields[3], out dt2))
        obj.Date2 = dt2;
}

The following regular expression seems to cover what I imagine you would need - at least a good start. 以下正则表达式似乎可以满足您的需求,至少是一个好的开始。

^(?<firstWord>[\w\s]*)\s+(?<secondWord>\d+)\s+(?<thirdWord>[\w\s_-]+)\s+(?<date>\d{4}-\d{2}-\d{2})\s+(?<time>\d{2}:\d{2}:\d{2})$

This captures 5 named groups 这捕获了5个命名组

  • firstWord is any alphanumeric or whitespace firstWord是任何字母数字或空格
  • secondWord is any numeric entry secondWord是任何数字输入
  • thirdWord any alphanumeric, space underscore or hyphen thirdWord任何字母数字,下划线或连字符
  • date is any iso formatted date (date not validated) date是任何iso格式的日期(日期未经验证)
  • time any time (time not validated) time的任何时间(时间不进行验证)

Any amount of whitespace is used as the delimiter - but you will have to Trim() any group captures. 任意数量的空格都用作分隔符-但您必须对所有组捕获都使用Trim() It makes a hell of a lot of assumptions about your format (dates are ISO formatted, times are hh:mm:ss). 这让很多关于您的格式假设的地狱 (日期格式ISO,时间是HH:MM:SS)。

You could use it like this: 您可以这样使用它:

Regex regex = new Regex( @"(?<firstWord>[\w\s]*)\s+(?<secondWord>\d+)\s+(?<thirdWord>[\w\s_-]+)\s+(?<date>\d{4}-\d{2}-\d{2})\s+(?<time>\d{2}:\d{2}:\d{2})$", RegexOptions.IgnoreCase );
var match = regex.Match("this is the first word        123     hello_world    2017-01-01 10:00:00");
if(match.Success){
    Console.WriteLine("{0}\r\n{1}\r\n{2}\r\n{3}\r\n{4}",match.Groups["firstWord"].Value.Trim(),match.Groups["secondWord"].Value,match.Groups["thirdWord"].Value,match.Groups["date"].Value,match.Groups["time"].Value);
}

http://rextester.com/LGM52187 http://rextester.com/LGM52187

You have to use Regex, you may have a look here as a starting point. 你必须使用正则表达式,你可以看看这里为起点。 so for example to get the first word you may use this 因此例如要获取第一个单词,您可以使用此单词

string data = "Example 2323 Second     This is a Phrase  2017-01-01 2019-01-03";
string firstword = new Regex(@"\b[A-Za-z]+\b").Matches(data )[0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM