How do I get the body of a http response from a string containing the entire response, in Python?

Question

I got the entire HTTP response as a string but I want to extract just the body.

I would prefer not to use an external library or reimplement the header parsing.

Content-Type: text/xml
Content-Length: 129

<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><boolean>0</boolean></value>
</param>
</params>
</methodResponse>
</code>

Update: If it wasn't obvious, I do get the data from other source than an URL so any attempt to use something that requires and URL is useless.

Still I do read the data from a stream like object data = stream.read() , so a solution that can use a stream is also acceptable.

2nd update , yes this is a XMLRPC response but it's one that I'm getting using a different transport so I cannot use httplib to parse it, mainly because httplib is broken and not accepting strings or streams for parsing.

3rd update , the double newline can be \\r\\n\\r\\n or \\n\\n based on the server.

So to make it clear: the input is a HTTP response that is supposed to contain an XMLRPC response and the output has to be the response . It doesn't have to parse the XML, but it has to be able to properly extract the XML from the response.

Answer 1

Based on Michal solution but this one includes and essential fix:

def strip_http_headers(http_reply):
    p = http_reply.find('\r\n\r\n')
    if p >= 0:
        return http_reply[p+4:]
    return http_reply

Answer 2

In HTTP response headers are separated from body by two CRLF characters. So you can use string.find() method like this:

p = http_reply.find('\r\n\r\n')
if p >= 0:
    return http_reply[p:]
return http_reply

Answer 3

Short and sweet:

body = response.split('\r\n\r\n', 1)[-1]

(it uses two argument version of split() and [-1] means last element of array)

Answer 4

resp = ('Content-Type: text/xml\r\n'
        'Content-Length: 129\r\n'
        "<?xml version='1.0'?>\r\n"
        '\r\n'
        '<methodResponse>\r\n'
        '<params>\r\n'
        '<param>\r\n'
        '<value><boolean>0</boolean></value>\r\n'
        '</param>\r\n'
        '</params>\r\n'
        '</methodResponse>\r\n'
        '</code>')

print resp.partition('\r\n\r\n')[2]

result

<methodResponse>
<params>
<param>
<value><boolean>0</boolean></value>
</param>
</params>
</methodResponse>
</code>

On my display, the characters '\\r' appear as squares at the end of each line.

The advantage of partition() is that it returns ALWAYS a tuple of 3 elements:
then, if there is not the sequence '\\r\\n\\r\\n' in the text,
resp.partition('\\r\\n\\r\\n')[2] will be ""
while split('\\r\\n\\r\\n')[1] causes an error and split('\\r\\n\\r\\n')[-1] is the entire text.

EDIT

If the double newline is variable, only a regex can hold the variability.
It is necessary to know what is the span of variability to craft a regex pattern.

Supposing that only "\\n\\n", "\\r\\n\\n", "\\n\\r\\n" and "\\r\\n\\r\\n" are possible , a solution would be to catch the body with help of the regex based on following pattern :

import re

regx = re.compile('(?:(?:\r?\n){2}|\Z)(.+)?',re.DOTALL)

for ss in (('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\n"
            '\n'
            'body1\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\r\n"
            '\n'
            'body2\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\n"
            '\r\n'
            'body3\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\r\n"
            '\r\n'
            'body4\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\r'
            "<?xml version='1.0'?>\r\r"
            '\r\n'
            'body4\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,):
    print ('splitting on sequence  :  %r\n%r\n') \
           % (re.search('(?:\r?\n)+(?=body)',ss).group(),
              regx.search(ss).group(1))

result

splitting on sequence  :  '\n\n'
'body1\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\r\n\n'
'body2\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\n\r\n'
'body3\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\r\n\r\n'
'body4\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\r\n'
None

Answer 5

You can use standard urllib2 :

from urllib2 import urlopen
data = urlopen('http://url.here/').read()

And if you want to parse xml :

from urllib2 import urlopen
from xml.dom.minidom import parse

xml = parse(urlopen('http://url.here'))

Answer 6

Besides what Tito said, there's also the requests package

>>> import requests
>>> r = requests.get("http://yoururl")
>>> r
<Response [200]>
>>> r.content
...

And then parse it with minidom or whatever tool you choose for that.

How do I get the body of a http response from a string containing the entire response, in Python?

Question

6 answers

solution1
5 ACCPTED 2011-12-12 15:50:27

solution2
2 2011-12-12 13:22:46

solution3
2 2011-12-12 19:56:59

solution4
2 2011-12-13 00:01:39

EDIT

solution5
1 2011-12-12 13:14:15

solution6
1 2011-12-12 13:24:03

How do I get the body of a http response from a string containing the entire response, in Python?

Question

6 answers

solution1 5 ACCPTED 2011-12-12 15:50:27

solution2 2 2011-12-12 13:22:46

solution3 2 2011-12-12 19:56:59

solution4 2 2011-12-13 00:01:39

EDIT

solution5 1 2011-12-12 13:14:15

solution6 1 2011-12-12 13:24:03

solution1
5 ACCPTED 2011-12-12 15:50:27

solution2
2 2011-12-12 13:22:46

solution3
2 2011-12-12 19:56:59

solution4
2 2011-12-13 00:01:39

solution5
1 2011-12-12 13:14:15

solution6
1 2011-12-12 13:24:03