简体   繁体   中英

How do I get the body of a http response from a string containing the entire response, in Python?

I got the entire HTTP response as a string but I want to extract just the body.

I would prefer not to use an external library or reimplement the header parsing.

Content-Type: text/xml
Content-Length: 129

<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><boolean>0</boolean></value>
</param>
</params>
</methodResponse>
</code>

Update: If it wasn't obvious, I do get the data from other source than an URL so any attempt to use something that requires and URL is useless.

Still I do read the data from a stream like object data = stream.read() , so a solution that can use a stream is also acceptable.

2nd update , yes this is a XMLRPC response but it's one that I'm getting using a different transport so I cannot use httplib to parse it, mainly because httplib is broken and not accepting strings or streams for parsing.

3rd update , the double newline can be \\r\\n\\r\\n or \\n\\n based on the server.

So to make it clear: the input is a HTTP response that is supposed to contain an XMLRPC response and the output has to be the response . It doesn't have to parse the XML, but it has to be able to properly extract the XML from the response.

Based on Michal solution but this one includes and essential fix:

def strip_http_headers(http_reply):
    p = http_reply.find('\r\n\r\n')
    if p >= 0:
        return http_reply[p+4:]
    return http_reply

In HTTP response headers are separated from body by two CRLF characters. So you can use string.find() method like this:

p = http_reply.find('\r\n\r\n')
if p >= 0:
    return http_reply[p:]
return http_reply

Short and sweet:

body = response.split('\r\n\r\n', 1)[-1]

(it uses two argument version of split() and [-1] means last element of array)

resp = ('Content-Type: text/xml\r\n'
        'Content-Length: 129\r\n'
        "<?xml version='1.0'?>\r\n"
        '\r\n'
        '<methodResponse>\r\n'
        '<params>\r\n'
        '<param>\r\n'
        '<value><boolean>0</boolean></value>\r\n'
        '</param>\r\n'
        '</params>\r\n'
        '</methodResponse>\r\n'
        '</code>')

print resp.partition('\r\n\r\n')[2]

result

<methodResponse>
<params>
<param>
<value><boolean>0</boolean></value>
</param>
</params>
</methodResponse>
</code>

On my display, the characters '\\r' appear as squares at the end of each line.

The advantage of partition() is that it returns ALWAYS a tuple of 3 elements:
then, if there is not the sequence '\\r\\n\\r\\n' in the text,
resp.partition('\\r\\n\\r\\n')[2] will be ""
while split('\\r\\n\\r\\n')[1] causes an error and split('\\r\\n\\r\\n')[-1] is the entire text.

EDIT

If the double newline is variable, only a regex can hold the variability.
It is necessary to know what is the span of variability to craft a regex pattern.

Supposing that only "\\n\\n", "\\r\\n\\n", "\\n\\r\\n" and "\\r\\n\\r\\n" are possible , a solution would be to catch the body with help of the regex based on following pattern :

import re

regx = re.compile('(?:(?:\r?\n){2}|\Z)(.+)?',re.DOTALL)

for ss in (('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\n"
            '\n'
            'body1\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\r\n"
            '\n'
            'body2\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\n"
            '\r\n'
            'body3\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\r\n"
            '\r\n'
            'body4\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\r'
            "<?xml version='1.0'?>\r\r"
            '\r\n'
            'body4\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,):
    print ('splitting on sequence  :  %r\n%r\n') \
           % (re.search('(?:\r?\n)+(?=body)',ss).group(),
              regx.search(ss).group(1))

result

splitting on sequence  :  '\n\n'
'body1\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\r\n\n'
'body2\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\n\r\n'
'body3\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\r\n\r\n'
'body4\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\r\n'
None

You can use standard urllib2 :

from urllib2 import urlopen
data = urlopen('http://url.here/').read()

And if you want to parse xml :

from urllib2 import urlopen
from xml.dom.minidom import parse

xml = parse(urlopen('http://url.here'))

Besides what Tito said, there's also the requests package

>>> import requests
>>> r = requests.get("http://yoururl")
>>> r
<Response [200]>
>>> r.content
...

And then parse it with minidom or whatever tool you choose for that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM