简体   繁体   中英

String parsing techniques

I am trying to find a good way to parse a message string into an object. The string is of fixed length and described below.

字符串规范的片段

  • protocol = int(2)
  • message type = string(1)
  • measurement = string(4)
  • etc

Doing a simple String.Split will work, but I think may be a bit cumbersome when you start to get towards the end of the string. eg:

var field1 = s.SubString(0,2);
var field2 = s.SubString(2,4);
....
var field99 = s.SubString(88,4); // difficult magic numbers

I considered using a Regex and thought that maybe even more confusing.

I was trying to think of an elegant solution, where I could create a Parser which was passed a 'config' that would detail how to parse the string.

Something like...

 MyConfig config = new MyConfig()
 config.Add("Protocol",    Length=2, typeof(int));
 config.Add("MessageType", Length=1, typeof(char));


 Parser p = new Parser(config);
 var parserResult = p.Parse(message);

...but I'm going around in circles at the minute and not getting anywhere. Any pointers would be a great help.

So a simple message structure:

class Message
{
    public DateTime DateTime { get; set; }
    public int Protocol { get; set; }
    public string Measurement { get; set; }
    public string Type { get; set; }
    //....
}

Combined with a class that knows how to deserialize it:

class MessageSerializer
{
    public Message Deserialize(string str)
    {
        Message message = new Message();
        int index = 0;
        message.Protocol = DeserializeProperty(str, ref index, 2, Convert.ToInt32);
        message.Type = DeserializeProperty(str, ref index, 1, Convert.ToString);
        message.Measurement = DeserializeProperty(str, ref index, 4, Convert.ToString);
        message.DateTime = DeserializeProperty<DateTime>(str, ref index, 16, (s) =>
        {
            // Parse date time from 2013120310:28:55 format
            return DateTime.ParseExact(s, "yyyyMMddhh:mm:ss", CultureInfo.CurrentCulture);
        });
        //...
        return message;
    }

    static T DeserializeProperty<T>(string str, ref int index, int count, 
        Func<string, T> converter)
    {
        T property = converter(str.Substring(index, count));
        index += count;
        return property;
    }
}

I don't think a regex is confusing if done the right way. You can use named capturing groups and you can define it quite neatly (example is for the first three fields, which you can extend as much as you want):

const string GRP_PROTOCOL = "protocol";
const string GRP_MESSAGE_TYPE = "msgtype";
const string GRP_MEASUREMENT = "measurement";

Regex parseRegex = new Regex(
    $"(?<{GRP_PROTOCOL}>.{{2}})" +
    $"(?<{GRP_MESSAGE_TYPE}>.{{1}})" +
    $"(?<{GRP_MEASUREMENT}>.{{4}})");

You can also define your groups and their lengths in an array:

const string GRP_PROTOCOL = "protocol";
const string GRP_MESSAGE_TYPE = "msgtype";
const string GRP_MEASUREMENT = "measurement";

Tuple<string, int>[] groups = {
    Tuple.Create( GRP_PROTOCOL, 2 ),
    Tuple.Create( GRP_MESSAGE_TYPE, 1 ),
    Tuple.Create( GRP_MEASUREMENT, 4 )
};

Regex parseRegex =
    new Regex(String.Join("", groups.Select(grp => $"(?<{grp.Item1}>.{{{grp.Item2}}})").ToArray()));

You can then access the groups by name whenever you need them:

Match match = parseRegex.Match(message);
string protocol = match.Groups[GRP_PROTOCOL].Value;
string msgType = match.Groups[GRP_MESSAGE_TYPE].Value;
string measurement = match.Groups[GRP_MEASUREMENT].Value;

If the properties inside the input string are fixed-width then Regex is overhead in both implementation and performance terms. An idea of creating a generic parser is good, but it makes sense if you have multiples parsers to implement. So there are no reasons to have an abstraction if there is only one particular implementation.

I would go with just StringReader :

using (var reader = new StringReader(input)) {
}

...and then creating a few helper extension methods like these:

// just a sample code, to get the idea

public static string ReadString(this TextReader reader, int count)
{
    var buffer = new char[count];
    reader.Read(buffer, 0, count);
    return string.Join(string.Empty, buffer);
}

public static int ReadNumeric(this TextReader reader, int count)
{
    var str = reader.ReadString(count);
    int result;
    if (int.TryParse(str, out result))
    {
        return result;
    }
    // handle error
}

// ...

and final usage would be like this:

using (var reader = new StringReader(input)) {
    var protocol = reader.ReadNumeric(2);
    var messageType = reader.ReadString(1);
    var measurement = reader.ReadString(4);
    // ...
}

An idea would be: GetNextCharacters(int position,int length, out newPosition) which gives you next length characters, the string you wanted, and new position for next call.

That way you only change the length in each call.

You can define a class with properties for each string section, and a custom attribute (Ex. FieldItem) that specify the start/end positions, in the constructor you can pass the whole string, then write some internal logic based on the properties attributes (using reflection) to load each property from the provided string (a ReadString method maybe, or whatever), based on SubString(start,end) usage with indexes taken from the custom attribute. It's more clean this way, I think, than by defining special regex, plus you can easily change the field definitions by just editing the attribute properties.

You might be able to leverage the TextFieldParser class. It can accept a list of field lengths for use in parsing.

using (var parser = new TextFieldParser(new StringReader(s))){
     parser.TextFieldType = FieldType.FixedWidth;
     parser.SetFieldWidths(2,1,4 /*etc*/);
     while (!parser.EndOfData)
     {
         var data = parser.ReadFields(); //string[]
     }
}

However this would only split out your data into an array of strings. If all your types were IConvertible , you could possibly do something like...

var types = new[] {typeof(int), typeof(string), typeof(string), typeof(DateTime), /*etc..*/ };
var data = parser.ReadFields();
var firstVal = Convert.ChangeType(data[0], types[0]); 
var secondVal = Convert.ChangeType(data[1], types[1]); 
// etc..
// or in a loop: 
for (var i = 0; i<data.Length;++i){
  var valAsString = data[i];
  var thisType = types[i];
  var value = Convert.ChangeType(valAsString , thisType);
  // do something with value
}

though Convert.ChangeType returns an object so the type of your variables would be of type object as well unless you were to cast them:

var firstVal = (int)Convert.ChangeType(data[0], types[0]);
// because unfortunately this is not valid:
var firstVal = (types[0])Convert.ChangeType(data[0], types[0]);

You might be able to leverage dynamic keyword in this case, though my experience with it is very little and I'm not sure it makes a difference:

dynamic firstVal = Convert.ChangeType(data[0], types[0]);

Note that there are performance penalties involved with both the dynamic keyword as well as the TextFieldParser class which has been documented to not be the most performant (just see other SO posts on this matter), at least with larger strings/files. Of course the use of TextFieldParser may also be overkill for your case if all you are ever doing is parsing a single string.

If you have a dto/poco class that represents this data, you could always pass the string array returned by ReadFields() into a constructor on your dto that can populate the data for you... ie:

class Message {
    public DateTime DateTime { get; set; }
    public int Protocol { get; set; }
    public string Type { get; set; }
    public string Measurement {get;set;}
    public Message(string[] data) {
       Protocol = int.Parse(data[0]);
       Type = data[1];
       Measurement = data[2];
       DateTime = DateTime.Parse(data[3]);
    }
}

As you said if your string is static you can use marshal class, like this:

[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode, Pack = 1)]
public struct TData
{
    [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 2)]
    public protocol string;
    [MarshalAs(UnmanagedType.ByValTStr, SizeConst =1)]
    public messageType string;
    [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 4)]
    public measurement 
...
    public int getProtocol(){return Convert.ToInt32(protocol);}
...
}

public string get(){
   var strSource="03EMSTR...";
    IntPtr pbuf = Marshal.StringToBSTR(buf);
    TData data= (TData)Marshal.PtrToStructure(pbuf,typeof(TData))
}

I think this method can make your code very pure and maintainable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM