How to match a string of a certain length with a regex

Question

For a project of mine, I'm trying to implement a small part of the BitTorrent protocol, which can be found here . Specifically, I want to use the "Bencoding" part of it, which is a way to safely encode data for transfer over a socket. The format is as follows:

8:a string => "a string"
i1234e => 1234
l1:a1:be => ['a', 'b']
d1:a1:b3:one3:twoe => {'a':'b', 'one':two}

The encoding part was easy enough, but decoding is become quite a hassle. For example, if I have a list of strings, I have no way to separate them into individual strings. I've tried several different solutions, including PyParsing and a custom token parser. I'm currently attempting to use regexes, and it seems to be going fairly well, but I'm still hung up on the string problem. My current regex is:

(?P<length>\d+):(?P<contents>.{\1})

However, i can't seem to use the first group as the length of the second group. Is there any good way to do this? Or am I approaching this all wrong, and the answer is sitting right in front of me?

Answer 1

Any parser you use for this is going to need to be stateful (ie remember stuff), and regexes are, by and large, not stateful. They're the wrong tool for this job.

If those are the only data types you have to worry about, I think I'd just write custom parsers for each data type, passing control to the appropriate parser after reading the first character.

~~I'd actually implement one now, but it's late.~~

Alright I decided to write an implementation:

from StringIO import StringIO
import string

inputs = ["10:a stringly",
         "i1234e" ,
         "l1:a1:be",
         "d1:a1:b3:one3:twoe"]

# Constants
DICT_TYPE = 'd'
LIST_TYPE = 'l'
INT_TYPE  = 'i'
TOKEN_EOF = ''
TOKEN_END = 'e'
COLON     = ':'


class BadTypeIndicatorException(Exception):pass


def read_int(stream):

   s = ""

   while True:
      ch = stream.read(1)
      if ch not in [TOKEN_EOF, TOKEN_END, COLON]:
         s += ch
      else:
         break

   return s


def tokenize(stream):

   s = ""

   while True:

      ch = stream.read(1)

      if ch == TOKEN_END or ch == TOKEN_EOF:
         return 

      if ch == COLON:
         length = int(s)
         yield stream.read(length)
         s = ""

      else:
         s += ch


def parse(stream):

   TYPE = stream.read(1)

   if TYPE in string.digits:
      length = int( TYPE + read_int(stream) )
      return stream.read(length)

   elif TYPE is INT_TYPE: 
      return int( read_int(stream) )

   elif TYPE is LIST_TYPE: 
      return list(tokenize(stream))

   elif TYPE is DICT_TYPE:
      tokens = list(tokenize(stream))
      return dict(zip(tokens[0::2], tokens[1::2]))

   else: 
      raise BadTypeIndicatorException



for input in inputs:
   stream = StringIO(input)
   print parse(stream)

Answer 2

You can do it if you parse the string twice. Apply the first regex to get the length. Concatenate the length in your second regex to form a valid expression.

Not sure how that can be done in python, but a sample in C# would be:

string regex = "^[A-Za-z0-9_]{1," + length + "}$"

To match 1 to length no of chars which can be alpanumeric or _ where length is determined from a previous regex that retrieves only the length.

Hope this helps :)

Answer 3

You'll want to do this in two steps. Regular expressions are actually a little overkill for such simple parsing problems as this. Here's how I'd do it:

def read_string(stream):
    pos = stream.index(':')
    length = int(stream[0:pos])
    string = stream[pos+1:pos+1+length]
    return string, stream[pos+1+length:]

It's a functional-style way of parsing, it returns the value parsed and the rest of the stream.

For lists, maybe:

def read_list(stream):
    stream = stream[1:]
    result = []
    while stream[0] != 'e':
        obj, stream = read_object(stream)
        result.append(obj)
    stream = stream[1:]
    return result

And then you'd define a read_object that checks the first character of the stream and dispatches appropriately.

Answer 4

You are using the wrong tool for the job... This requires some sort of state keeping, and generally speaking, regular expressions are stateless.

An example implementation of bdecoding (and bencoding) in PERL that I did can be found here .

An explanation of how that function works (since I never did get to comment it [oops]):

Basically what you need to do is setup a recursive function. This function takes a string reference (so it can be modified) and returns "something" (the nature of this means it could be an array, a hashtable, an int, or a string).

The function itself just checks the first character in the string and decides what to do based of that:

If it is an i , then parse out all the text between the i and the first e , and try to parse it as an int according to the rules of what is allowed.
If it is a digit, then read all the digits up to : , then read that many characters off the string.

Lists and dictionaries are where things start to get interesting... if there is an l or d as the first character, then you need to strip off the l / d , then pass the current string back into the function, so that it can start parsing elements in the list or dictionary. Then just store the returned values in the appropriate places in an appropriate structure till you hit an e , and return the structure you're left with.

Remember, the function as I implemented it was DESTRUCTIVE. The string passed in is empty when the function returns due to it being passed by reference, or more accurately, it will be devoid of anything it parsed and returned (which is why it can be used recursively: anything it doesn't process is left untouched). In most cases of the initial call though, this should process everything unless you've been doing something odd, so the above holds.

Answer 5

Pseudo-code, without syntax checks:

define read-integer (stream):
    let number 0, sign 1:
        if string-equal ('-', (c <- read-char (stream))):
            sign <- -1
          else:
            number <- parse-integer (c)
        while number? (c <- read-char (stream)):
            number <- (number * 10) + parse-integer (c)
        return sign * number

define bdecode-string (stream):
    let count read-integer (stream):
        return read-n-chars (stream, count)

define bdecode-integer (stream):
    ignore read-char (stream)
    return read-integer (stream)

define bdecode-list (stream):
    ignore read-char (stream)
    let list []:
        while not string-equal ('e', peek-char (stream)):
            append (list, bdecode (stream))
        return list

define bdecode-dictionary (stream):
    let list bdecode-list stream:
        return dictionarify (list)

define bdecode (stream):
    case peek-char (stream):
        number? => bdecode-string (stream)
        'i' => bdecode-integer (stream)
        'l' => bdecode-list (stream)
        'd' => bdecode-dictionary (stream)

How to match a string of a certain length with a regex

Question

5 answers

solution1
8 ACCPTED 2009-06-02 06:27:52

solution2
2 2009-06-02 06:27:42

solution3
2 2009-06-02 06:52:50

solution4
1 2009-06-02 06:51:32

solution5
1 2009-06-02 12:32:02

How to match a string of a certain length with a regex

Question

5 answers

solution1 8 ACCPTED 2009-06-02 06:27:52

solution2 2 2009-06-02 06:27:42

solution3 2 2009-06-02 06:52:50

solution4 1 2009-06-02 06:51:32

solution5 1 2009-06-02 12:32:02

solution1
8 ACCPTED 2009-06-02 06:27:52

solution2
2 2009-06-02 06:27:42

solution3
2 2009-06-02 06:52:50

solution4
1 2009-06-02 06:51:32

solution5
1 2009-06-02 12:32:02