[英]How do I add multithreading this?
I don't know how to web scrape that much, I wrote this code but it is running really slowly, this code is used to get the search results from a google chrome query.我不知道 web 怎么刮那么多,我写了这段代码,但它运行得很慢,这段代码用于从谷歌浏览器查询中获取搜索结果。 I want to try to add multithreading but I don't really know how.我想尝试添加多线程,但我真的不知道如何。 Can somebody tell me how to multithread?有人可以告诉我如何多线程吗? Also which function am I supposed to multithread?还有哪个 function 我应该多线程?
import urllib
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
# desktop user-agent
def get_listing(url):
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
html = None
links = None
r = requests.get(url, headers=headers, timeout=10)
if r.status_code == 200:
html = r.text
soup = BeautifulSoup(html, 'lxml')
listing_section = soup.select('#offers_table table > tbody > tr > td > h3 > a')
links = [link['href'].strip() for link in listing_section]
return links
def scrapeLinks(query_string):
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
query = query_string
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
"title": title,
"link": link
}
results.append(item)
return results
def getFirst5Results(query_string):
list = scrapeLinks(query_string)
return [list[0]["link"], list[1]["link"], list[2]["link"], list[3]["link"], list[4]["link"]]
Few things about multithreading关于多线程的一些事情
scrapeLinks
function for scraping.假设您正在使用scrapeLinks
function 进行抓取。 Here's some code:这是一些代码: import threading t1 = threading.Thread(target = scrapeLinks, args = (query_string,) t1.start()
In order to retrieve results from the thread use: t1.join()
为了从线程中检索结果,请使用: t1.join()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.