[英]Wrong number of results trying to scrape google in a date range
我一直在尝试在 google 上抓取特定日期范围内的结果数量。 我通过将日期插入到 google 搜索查询中来完成此操作。但是,我编写的代码正在获取超出日期范围的搜索结果数。 我的代码如下:
query='Kevin Spacey prima:14-01-2020 dopo:14-01-2020'
for url in search(
query,
tld='it',
lang='it',
num=20,
start=0,
stop=None,
pause=2.0
):
try:
r = requests.get(url, timeout=None)
r.headers
r.status_code
urls.append(url)
except:
pass
从谷歌搜索我得到13
结果; 使用我的代码39
。 问题是“我的”结果与谷歌的结果不匹配。 我认为问题出在查询中,特别是在日期范围内,但我不完全确定如何解决它。 也许还有另一个我还没有发现的错误。 我希望你能告诉我我做错了什么。
感谢您的时间和帮助。
请在此处查看 Google 的结果以及我代码的输出下方。
https://tv.zam.it/programmi_in_tv_stasera.php
https://www.paramountnetwork.it/video/v5ln5t/film-paramount-network-gli-highlights-per-la-settimana-del-2-marzo-2020
https://www.davidemaggio.it/archives/181396/programmi-tv-di-stasera-martedi-14-gennaio-2020-su-rai2-il-film-amore-cucina-e-curry-al-posto-de-il-molo-rosso-spostato-in-seconda-serata
https://www.davidemaggio.it/archives/181401/ascolti-tv-lunedi-13-gennaio-2020
https://www.mymovies.it/film/2016/elvisnixon/pubblico/?id=778281
https://www.ilfoglio.it/siteMapVideo.jsp
http://www.starpolitics.it/author/redazione/page/2/
http://www.zorrolaleggenda.rai.it/dl/RaiTV/programmi/media/ContentItem-4acbbd88-0529-4ca5-a390-96cb38dd2317.html
https://www.lagazzettadellospettacolo.it/cinema/26473-nicholas-hoult-giurati-giffoni-film-festival-2016/
https://www.viaggiareleggeri.com/cerca/x/i
https://www.lagazzettadellospettacolo.it/musica/30431-peter-cincotti-live-italia/
https://www.viaggiareleggeri.com/cerca/x/-?ref=28250
https://www.audible.it/pd/Harry-Potter-e-il-Prigioniero-di-Azkaban-Harry-Potter-3-Audiolibri/B077HVX4WM
https://www.hfw.com/Briefings
http://www.inmediarex.it/cinema-tv/cinema-tv-recensioni/american-gods-la-serie-niente-di-cosi-divino/
http://america24.com/sitemapArticles.xml
https://www.weenjoy.net/sitemap/
https://ierioggidomaniblog.com/2017/06/02/e-arrivata-la-promo-shock-universal-su-amazon-tante-offerte-fino-al-2-luglio/
https://ierioggidomaniblog.com/2018/01/13/universal-pictures-baby-driver-barry-seal-linganno-e-madre/
https://www.glartent.com/IT/Rome/112229858801846/giovani-artisti-associati-srl
https://tubestar.it/breakingitaly
https://www.freeforumzone.com/d/1543749/Oggi-ho-visto-in-TV/discussione.aspx/18
https://mjj.freeforumzone.com/discussione.aspx?idd=662389
https://www.diariodelweb.it/tuttosu/tag/?q=4750
https://civiltascomparse.wordpress.com/category/p-greco/?ak_action=reject_mobile
https://www.ubook.com/audiobook/348309/copy-persuasivo-di-andrea-lisi
https://ipersphera.org/category/attrice/
https://www.luogocomune.net/28-opinione/4827-svezia-laboratorio-per-il-nwo
https://www.globalnpo.org/IT/Salerno/1382814642039640/La-Bottega-Di-Will
https://www.qoop.it/osvaldo-raschi-pugile?page=1
https://www.qoop.it/pugile-al-cogan?filter=lastyear
http://www.caminantes.it/page-16/index.php?categories=giornalisti
https://www.altadefinizione01.tel/10495-terminator-destino-oscuro-stream-ita.html
https://www.emailers.it/codice-sconto-del-50-cibdol-10-promozione-limitata/
https://aimatrabolmeicher.com/2014/03/03/oscar-2014-and-the-winner-is/
https://aimatrabolmeicher.com/goodbye/page/2365/
http://scandalissimi.it/home-archive.php
https://picnano.com/tags/prossimieventi
https://vilook.com/video/9E0I69VkXFc/il-lento-declino-dellitalia-qual-%C3%A8-il-vero-problema-breakingitaly-news
网站总数:39(包括 HTTP 错误)
更新:
这是自定义研究后所有结果的网址:
为了在代码中实现它们,我需要查看的字段:
www.google.co.uk ; I would prefer to look at www.google.it
q=Kevin+spacey
lr=lang_it
cr=countryIT
hl=it
tbs=lr:lang_1it,ctr:countryIT,cdr:1,cd_min:1/14/2020,cd_max:1/14/2020
返回13
结果的查询使用tbs
参数指定日期限制而不是内联查询prima:14-01-2020 dopo:14-01-2020
。 googlesearch
支持tbs
,甚至还有一个辅助函数get_tbs
您可以使用并将datetime.date
from
和传递to
。 您还必须在查询countryIT
country
指定为countryIT
。
整个工作脚本:
from googlesearch import search, get_tbs
import datetime
# query='Kevin Spacey prima:14-01-2020 dopo:14-01-2020'
query='Kevin Spacey'
urls = []
index = 0
for url in search(
query,
tld='it',
lang='it',
country='countryIT',
num=20,
start=0,
stop=None,
pause=2.0,
tbs=get_tbs(
datetime.date(2020, 1, 14),
datetime.date(2020, 1, 14))
):
urls.append(url)
print("%d: %s" % (index, url))
index += 1
print("\nTotal results found: %d\n" % (len(urls)))
将输出:
0: https://www.cinematown.it/2020-01-oscar-2020-previsioni-scommesse/
1: https://www.cinematown.it/2020-01-notte-sul-pianeta-terra-trailer/
2: https://blog.italiansubs.net/critics-choice-awards-2020-i-vincitori/
3: https://www.amazon.it/Patrick-DVD/dp/B07J33SHLC
4: http://www.viraland.it/2020/01/14/cinema-e-gioco-i-migliori-film-ispirati-al-gaming/
5: https://www.altadefinizione01.tel/catalog/t/
6: https://www.altadefinizione01.tel/10495-terminator-destino-oscuro-stream-ita.html
7: https://www.sentieridelcinema.it/oscar-2020-tutte-le-nomination/
8: https://www.dailymood.it/2020/01/14/nomination-oscar-2020-comanda-joker-tarantino-e-scorsese-lo-tallonano/
9: https://www.cineblog.it/post/932961/bloodshot-nuovo-trailer-vin-diesel-film
10: https://www.cineblog.it/post/932933/black-widow-film-nuovo-trailer
11: https://www.davidemaggio.it/archives/181403/la-guerra-non-e-finita
12: https://www.davidemaggio.it/archives/181385/festival-di-sanremo-2020-donne-chi-sono
13: https://www.rossinavi.it/column/money/2408/
Total results found: 14
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.