解析键，值对的请求响应

Question

I'm storing the response from a POST request to Instagram's API into a text file. 我将POST请求对Instagram API的响应存储到文本文件中。 What's contained in this response is HTML, which includes an access token I'd like to dig out. 此响应中包含HTML，其中包含我想挖掘的访问令牌。 The reason it's HTML is because this POST response is really meant to be dealt with by the end user, wherein they click a button and are then provided with the access code. 之所以使用HTML，是因为该POST响应实际上是由最终用户处理的，最终用户可以单击一个按钮，然后为其提供访问代码。 However I need to do this on the backend, hence needing to deal with the HTML response. 但是我需要在后端执行此操作，因此需要处理HTML响应。

In any event, here's my code so far (real client ID is obscured for this post obviously): 无论如何，这是到目前为止我的代码（此帖子显然掩盖了真实的客户ID）：

OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e65f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
OAuth_AccessRequest = requests.post(OAuthURL).text 
#print OAuth_AccessRequest

with open('response.txt', 'w') as OAuthResponse:
        OAuthResponse.write(OAuth_AccessRequest.encode("UTF-8"))

OAuthReady = open('response.txt', 'r')
OAuthView = OAuthReady.read()
print OAuthView

What I'm left with after this is HTML stored in a text file. 之后剩下的就是HTML存储在文本文件中。 Among the HTML however are dictionaries, which I need to access the value, pairs for — some of it, for example, looks like this: 但是在HTML中有字典，我需要访问该字典的值，成对存在-例如，其中的一些看起来像这样：

</div> <!-- .root -->

    <script src=//instagramstatic-a.akamaihd.net/bluebar/422f3d9/scripts/polyfills/es5-shim.min.js></script>
<script src=//instagramstatic-a.akamaihd.net/bluebar/422f3d9/scripts/polyfills/es5-sham.min.js></script>
<script type="text/javascript">window._sharedData = {"static_root":"\/\/instagramstatic-a.akamaihd.net\/bluebar\/422f3d9","entry_data":{},"hostname":"instagram.com","platform":{"is_touch":false,"app_platform":"web"},"qe":{"su":false},"display_properties_server_guess":{"viewport_width":360,"pixel_ratio":1.5},"country_code":"US","language_code":"en","gatekeepers":{"tr":false},"config":{"dismiss_app_install_banner_until":null,"viewer":null,"csrf_token":"2aedabf96ad1fe86fab0"},"environment_switcher_visible_server_guess":true};</script>

    </body>
</html>

It's the string of numbers that is the value for the key "csfr_token" that I need to grab. 这是数字字符串，这是我需要抓取的键“ csfr_token”的值。 What's the best approach for digging this out of the HTML that's stored in the txt file? 从txt文件中存储的HTML中挖掘出来的最佳方法是什么？

Answer 1

If the csrf_token string is the only such string in the whole page, it'll be trivial to extract it with a regular expression: 如果csrf_token字符串是整个页面中唯一的这样的字符串，则使用正则表达式提取它会很简单：

import re

token_pattern = re.compile(r'"csrf_token":\s*"([^"]+)"')

token = token_pattern.search(requests.post(OAuthURL).content).group(1)

Note that I used the content attribute of the response, there is no point in decoding the whole response to Unicode when all you need is a few ASCII characters. 请注意，我使用了响应的content属性，当您只需要几个ASCII字符时，就没有必要将整个响应解码为Unicode。

Demo: 演示：

>>> import requests, re
>>> token_pattern = re.compile(r'"csrf_token":\s*"([^"]+)"')
>>> OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e65f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
>>> token_pattern.search(requests.post(OAuthURL).content).group(1)
'3fd6022ac344c3eaea46e87e258ef9c6'

You may want to look at the headers and cookies of the response as well; 您可能还需要查看响应的标题和cookie 。 a CSRF token is usually also set as a cookie (or at the very least as a value in the session). CSRF令牌通常也设置为cookie（或者至少设置为会话中的值）。

For this specific request for example, the token is also stored as a cookie, matching the value in the JavaScript block: 例如，对于此特定请求，令牌还存储为cookie，与JavaScript块中的值匹配：

>>> r = requests.post(OAuthURL)
>>> r.cookies
<RequestsCookieJar[Cookie(version=0, name='csrftoken', value='b2b621c198642e26a19fc9bf1b38d246', port=None, port_specified=False, domain='instagram.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1467828030, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]>
>>> r.cookies['csrftoken']
'b2b621c198642e26a19fc9bf1b38d246'
>>> 'b2b621c198642e26a19fc9bf1b38d246' in r.content
True
>>> token_pattern.search(r.content).group(1)
'b2b621c198642e26a19fc9bf1b38d246'

解析键，值对的请求响应

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-07-08 17:36:39

解析键，值对的请求响应

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-07-08 17:36:39

解决方案1
2 已采纳 2015-07-08 17:36:39