解析Google搜索响应Json Python请求。正则表达式多行

Question

I want to transform the original response into a valid JSON, which I am able to do, but in a bit sloppy way. 我想将原始响应转换为有效的JSON，我可以这样做，但是有点草率。

This is the original response: 这是原始回复：

// API callback
google.search.Search.apiary2387({
 "cursor": {
  "currentPageIndex": 0,
  "estimatedResultCount": "4490",
  "moreResultsUrl": "http://www.google.com/cse?oe=utf8&ie=utf8&source=uds&q=ssh&start=0&sort=&cx=013305635491195529773:0ufpuq-fpt0",
  "resultCount": "4,490",
  "searchResultTime": "0.22",
  "pages": [
   {
    "label": 1,
    "start": "0"
   },
   {
    "label": 2,
    "start": "1"
   },
   {
    "label": 3,
    "start": "2"
   },
   {
    "label": 4,
    "start": "3"
   },
   {
    "label": 5,
    "start": "4"
   },
   {
    "label": 6,
    "start": "5"
   },
   {
    "label": 7,
    "start": "6"
   },
   {
    "label": 8,
    "start": "7"
   },
   {
    "label": 9,
    "start": "8"
   },
   {
    "label": 10,
    "start": "9"
   }
  ]
 },
 "context": {
  "title": "Pastebin Active",
  "total_results": "0",
  "facets": []
 },
 "results": [
  {
   "GsearchResultClass": "GwebSearch",
   "cacheUrl": "http://www.google.com/search?q=cache:PBL2A25kpZoJ:pastebin.com",
   "clicktrackUrl": "https://www.google.com/url?q=http://pastebin.com/u/ssh&sa=U&ved=0ahUKEwiO4fjNpovMAhWBPxoKHYJXAS4QFggEMAA&client=internal-uds-cse&usg=AFQjCNHczEhDXdcUnRZhpArEeSiHfjwMJA",
   "content": "BitBucket - Backup your code in the cloud! Host unlimited private projects, for free\n. SIGN UP takes 10 seconds, and it&#39;s free! Guest&nbsp;...",
   "contentNoFormatting": "BitBucket - Backup your code in the cloud! Host unlimited private projects, for free\n. SIGN UP takes 10 seconds, and it's free! Guest ...",
   "formattedUrl": "pastebin.com/u/\u003cb\u003essh\u003c/b\u003e",
   "title": "\u003cb\u003eSsh&#39;s\u003c/b\u003e Pastebin - Pastebin.com",
   "titleNoFormatting": "Ssh's Pastebin - Pastebin.com",
   "unescapedUrl": "http://pastebin.com/u/ssh",
   "url": "http://pastebin.com/u/ssh",
   "visibleUrl": "pastebin.com",
   "richSnippet": {
    "cseImage": {
     "src": "http://pastebin.com/i/facebook.png"
    },
    "metatags": {
     "fbAppId": "231493360234820",
     "ogTitle": "Ssh's Pastebin - Pastebin.com",
     "ogType": "article",
     "ogUrl": "http://pastebin.com/u/ssh",
     "ogImage": "http://pastebin.com/i/facebook.png",
     "ogSiteName": "Pastebin",
     "viewport": "width=device-width, maximum-scale=1.0, user-scalable=no"
    }
   }
  }
 ]
}
);

And in order to extract the valid JSON I have to remove the JavaScript call, so I remove everything until the first ( and I remove at the end, the ) . 为了提取有效的JSON，我必须删除JavaScript调用，所以我删除了所有内容，直到第一个(并且最后删除了) 。

This is how I think it would work: 我认为这是可行的：

import requests
import re
import json

url = 'https://www.googleapis.com/customsearch/v1element?key=AIzaSyCVAXiUzRYsML1Pv6RwSG1gunmMikTzQqY&rsz=filtered_cse&num=1&hl=en&prettyPrint=true&source=gcsc&gss=.com&sig=432dd570d1a386253361f581254f9ca1&start=0&cx=013305635491195529773:0ufpuq-fpt0&q=ssh&sort=&googlehost=www.google.com&callback=google.search.Search.apiary2387'

resp = requests.get(url)

content = resp.content

formatted = re.sub(r'(.*\(|\);$)','', content , re.I|re.M|re.DOTALL)

formatted_json = json.loads(formatted)
for i, result in enumerate(formatted_json['results']):
    print formatted_json['results'][i]['url']

And this is what I had to add in order to make it work: 这是我必须添加才能使其工作的内容：

formatted = re.sub(r'// API callback', '', content)

I don't know why, since I am removing everything until I find the ( , why doesn't it apply to all lines if I use the flag re.M ¿? 我不知道为什么，因为我将删除所有内容，直到找到( ，如果使用标志re.M ，为什么它不适用于所有行？

You can see (r'(.*\$|\$;$)','', content , re.DOTALL) should work: 您可以看到(r'(.*\$|\$;$)','', content , re.DOTALL)应该可以工作：

https://regex101.com/r/uN2wV4/3 https://regex101.com/r/uN2wV4/3

(the option /s means . is also \\n like DOTALL ) （选项/s表示.也是\\n像DOTALL ）

Answer 1

I created following regex : 我创建了以下regex ：

(\{(?:.|\n)*\})

which instead of replacing, gets the contents between opening and closing braces. 而不是替换，而是使内容在左括号和右括号之间。

So you can use this with re.search to get what you need: 因此，您可以将其与re.search结合使用以获取所需的内容：

formatted = re.search(r'(\{(?:.|\n)*\})', content).group()

UPDATE: to use re.DOTALL 更新：使用re.DOTALL

re.DOTALL is equivalent to the /s modifier (Updated regex ): re.DOTALL等效于/s修饰符（更新了regex ）：

formatted = re.search(r'(\{.*\})', content, re.DOTALL).group()

Answer 2

Easiest way: just removing this parameter from the request: 最简单的方法：只需从请求中删除此参数：

callback=google.search.Search.apiary2387

And the response is a valid Json. 响应是有效的Json。

解析Google搜索响应Json Python请求。正则表达式多行

问题描述

2 个解决方案

解决方案1
1 2016-04-13 10:06:57

解决方案2
0 已采纳 2016-04-13 14:28:03

解析Google搜索响应Json Python请求。 正则表达式多行

问题描述

2 个解决方案

解决方案1 1 2016-04-13 10:06:57

解决方案2 0 已采纳 2016-04-13 14:28:03

解析Google搜索响应Json Python请求。正则表达式多行

解决方案1
1 2016-04-13 10:06:57

解决方案2
0 已采纳 2016-04-13 14:28:03