简体   繁体   中英

Conventional method for socket based application protocol parsing

What is the conventional way to parse an application protocol?

Given a stream from a socket of an already designed protocol (SMTP for example), what is the usual way to process the protocol. Is it a yacc based parser, a regex based approach or other way?

There are many application layer protocols, but I think the main difference lies in whether it's binary or text based . Both are used extensively.

For a text based protocol it's quite usual to tokenize the input and then parse it with something like yacc . Some text based protocols are even easier to parse than that, so you might just split up the input and check whether it makes sense. Encoding should be taken into account, but it must be something you already have the routines for in your language via built-in methods or a library, eg UTF-8 . HTTP, for instance is a text protocol, and is quite easy to parse (example from here ):

Request:

GET /path/file.html HTTP/1.0
From: someuser@jmarshall.com
User-Agent: HTTPTool/1.0
[blank line here]

Response:

HTTP/1.0 200 OK
Date: Fri, 31 Dec 1999 23:59:59 GMT
Content-Type: text/html
Content-Length: 1354

<html>
<body>
<h1>Happy New Millennium!</h1>
(more file contents)
  .
  .
  .
</body>
</html>

Most programmers can write a parser for that, even though you better rely on a well-tested and complete library implementation.

Binary protocols are somewhat different though. The first thing is to encode/decode the message using eg ASN.1 (used very often in telecom), protocol buffers or something similar. If possible, don't invent your own binary format, rely on tested and tried libraries - this is hard to get right, no wonder that eg for ASN.1 most tools are expensive.

This is ASN.1 UPER , you define a simple element eg (example from here ):

myQuestion FooQuestion ::= {
    trackingNumber     5,
    question           "Anybody there?"
}

and it gets encoded like this:

01 05 0e 83 bb ce 2d f9 3c a0 e9 a3 2f 2c af c0

With all the bit-shifting and masking, it's not very easy to implement - that's why open source ASN.1 libraries with PER support are so rare.

Both approaches have their advantages/disadvantages. Text based protocols are somewhat easier to get right, debug and understand. They are usually quite chatty, though, and under certain circumstances this matters a lot. This is when one chooses eg ASN.1 PER , which is very difficult to implement or debug, but very compact.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM