简体   繁体   English

我该如何添加多线程呢?

[英]How do I add multithreading this?

I don't know how to web scrape that much, I wrote this code but it is running really slowly, this code is used to get the search results from a google chrome query.我不知道 web 怎么刮那么多,我写了这段代码,但它运行得很慢,这段代码用于从谷歌浏览器查询中获取搜索结果。 I want to try to add multithreading but I don't really know how.我想尝试添加多线程,但我真的不知道如何。 Can somebody tell me how to multithread?有人可以告诉我如何多线程吗? Also which function am I supposed to multithread?还有哪个 function 我应该多线程?

import urllib
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
# desktop user-agent

def get_listing(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
    html = None
    links = None

    r = requests.get(url, headers=headers, timeout=10)

    if r.status_code == 200:
        html = r.text
        soup = BeautifulSoup(html, 'lxml')
        listing_section = soup.select('#offers_table table > tbody > tr > td > h3 > a')
        links = [link['href'].strip() for link in listing_section]
    return links

def scrapeLinks(query_string):
    USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"

    query = query_string
    query = query.replace(' ', '+')
    URL = f"https://google.com/search?q={query}"

    headers = {"user-agent": USER_AGENT}
    resp = requests.get(URL, headers=headers)

    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")
        results = []
        for g in soup.find_all('div', class_='r'):
            anchors = g.find_all('a')
            if anchors:
                link = anchors[0]['href']
                title = g.find('h3').text
                item = {
                    "title": title,
                    "link": link
                }
                results.append(item)
        return results

def getFirst5Results(query_string):
    list = scrapeLinks(query_string)
    return [list[0]["link"], list[1]["link"], list[2]["link"], list[3]["link"], list[4]["link"]]

Few things about multithreading关于多线程的一些事情

  • You can use it for the code that required network calls.您可以将它用于需要网络调用的代码。 For instance, invoking an api.例如,调用 api。
  • When the code would run for a longer duration of time, and you want to run the process in the background.当代码将运行更长的时间,并且您希望在后台运行该进程时。
  • In the case you've stated the web scraping is a long running tasks, as it involves network call to google api and parsing of the results after we get the results back.如果您说 web 抓取是一项长期运行的任务,因为它涉及对谷歌 api 的网络调用以及在我们得到结果后解析结果。 Assuming that you're using scrapeLinks function for scraping.假设您正在使用scrapeLinks function 进行抓取。 Here's some code:这是一些代码:

import threading t1 = threading.Thread(target = scrapeLinks, args = (query_string,) t1.start()

In order to retrieve results from the thread use: t1.join()为了从线程中检索结果,请使用: t1.join()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM