[英]Scrape links from many Google searches in Python
我想抓取第一个链接,该链接显示在Google搜索中,显示23000个搜索,并将它们附加到我正在使用的数据框中。 这是我得到的错误:
Traceback (most recent call last):
File "file.py", line 26, in <module>
website = showsome(company)
File "file.py", line 18, in showsome
hits = data['results']
TypeError: 'NoneType' object has no attribute '__getitem__'
这是我到目前为止的代码:
import json
import urllib
import pandas as pd
def showsome(searchfor):
query = urllib.urlencode({'q': searchfor})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
search_response = urllib.urlopen(url)
search_results = search_response.read()
results = json.loads(search_results)
data = results['responseData']
hits = data['results']
d = hits[0]['visibleUrl']
return d
company_names = pd.read_csv("my_file.csv")
websites = []
for company in company_names["Company"]:
website = showsome(company)
websites.append(website)
websites = pd.DataFrame(websites, columns=["Website"])
result = pd.concat([company_names,websites], axis=1, join='inner')
result.to_csv("export_file.csv", index=False, encoding="utf-8")
(出于隐私原因,我更改了输入和输出文件的名称)
谢谢!
我将尝试回答为什么会引发此异常-
我看到Google检测到您并发布了格式化好的回复,即
{u'responseData': None, u'responseDetails': u'Suspected Terms of Service Abuse. Please see http://code.google.com/apis/errors', u'responseStatus': 403}
然后通过下面的表达式将其分配给results
。
results = json.loads(search_results)
因此data = results['responseData']
等于None
并且当您运行hits = data['results']
- data['results']
会引发错误,因为data
为None
且它没有results
属性-
我试图使用random
模块(只是一个简单的尝试)通过一些等待来模拟真实(- 但如果您没有Google BTW的许可,我强烈反对使用此模块,我使用time.sleep(random.choice((1,3,3,2,4,1,0)))
如下。
import json,random,time
import urllib
import pandas as pd
def showsome(searchfor):
query = urllib.urlencode({'q': searchfor})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
search_response = urllib.urlopen(url)
search_results = search_response.read()
results = json.loads(search_results)
data = results['responseData']
hits = data['results']
d = hits[0]['visibleUrl']
return d
company_names = pd.read_csv("my_file.csv")
websites = []
for company in company_names["Company"]:
website = showsome(company)
websites.append(website)
time.sleep(random.choice((1,3,3,2,4,1,0)))
print website
websites = pd.DataFrame(websites, columns=["Website"])
result = pd.concat([company_names,websites], axis=1, join='inner')
result.to_csv("export_file.csv", index=False, encoding="utf-8")
它生成的csv包含-
Company,Website
American Axle,www.aam.com
American Broadcasting Company,en.wikipedia.org
American Eagle Outfitters,ae.com
American Electric Power,www.aep.com
American Express,www.americanexpress.com
American Family Insurance,www.amfam.com
American Financial Group,www.afginc.com
American Greetings,www.americangreetings.com
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.