简体   繁体   English

Python Api 请求关键字研究工具

[英]Python Api Requests to keyword research tool

I am currently trying to build a script which allows to do keyword research at scale (for SEO) by using the Grepwords API.我目前正在尝试构建一个脚本,该脚本允许使用 Grepwords API 进行大规模的关键字研究(针对 SEO)。

I have come across a few issues along the way (also still very new to Python so any help is really appreciated :-))在此过程中,我遇到了一些问题(对 Python 来说仍然很新,因此非常感谢任何帮助:-))

I would like the following script to be able to handle an input file containing ~600k keywords and return the search volume for these.我希望以下脚本能够处理包含 ~600k 关键字的输入文件并返回这些关键字的搜索量。

One of the first issues I had a was a TimeoutError which I tried to solve by using the sleep module, however now i am getting a HTTPError: HTTP Error 502: Bad Gateway我遇到的第一个问题是TimeoutError ,我试图通过使用 sleep 模块来解决它,但是现在我收到了HTTPError: HTTP Error 502: Bad Gateway

How would I also update the code so it creates a new file after every 10,000th row?我还将如何更新代码,以便在每 10,000 行后创建一个新文件? Thanks谢谢

import csv
import urllib.request
import urllib.parse
import json
import time


# Construct the URL template.
root_url = 'http://api.grepwords.com/lookup?apikey='
api_key = 'xxx'
country = 'united_kingdom'
url_template = root_url + api_key + '&loc=' + country + '&q='

# Read from the source file.
keywords = []
with open("example.csv", 'r') as input_file:
   fileReader = csv.reader(input_file, delimiter=',')
   for row in fileReader:
       keywords.append(row)

# Get and format the output.
output = [['Country: ' + country], [], ['Keyword', 'ams']]
for keyword in keywords:
   # Construct the final URL.
   parsed_keyword = urllib.parse.quote(keyword[0])
   url = url_template + parsed_keyword

   # Query the API.
   all_keyword_data = json.loads(urllib.request.urlopen(url).read())
   time.sleep(2) 

   try:
       estimated_impressions = all_keyword_data["keywords"][keyword[0]]['ams']
   except TypeError:
      ams = "NA"

   # Prepare the output data.
   keyword_data = [keyword[0], ams]
   output.append(keyword_data)

# Write to the output file.
with open("examplefile2.csv", 'w', newline='') as output_file:
   fileWriter = csv.writer(output_file, delimiter=',')
   for row in output:
       fileWriter.writerow(row)

You may want to contact the grepwords admin team to ask why their service is giving you 502 HTTP errors, this indicates something is broken on their side.您可能想联系grepwords 管理团队,询问为什么他们的服务给您 502 HTTP 错误,这表明他们这边出了点问题

That said, sending 600k separate requests is certainly excessive, and likely to lead to issues.也就是说,发送60 万个单独的请求肯定是多余的,并且可能会导致问题。 Only the top-tier Enterprise pricing plan lets you query for that many keywords in a month.只有顶级企业定价计划才能让您在一个月内查询这么多关键字。 You may be running into a rate limit where you have exceeded your plan quota.您可能会遇到超出计划配额的速率限制。

Next, if you do have such a generous plan, then the API allows for multiple keywords to be sent, per query.接下来,如果您确实有如此慷慨的计划,那么 API 允许每个查询发送多个关键字 This can be used to cut back greatly on the number of API calls you have to make, but not on the number of keywords you query (so it won't let you get more keywords queried against your paid plan).这可用于大大减少您必须进行的 API 调用次数,但不能减少您查询的关键字数量(因此它不会让您根据付费计划查询更多关键字)。 The number you can send is limited by the limits of what URL size the server accepts;您可以发送的数量受服务器接受的 URL 大小的限制; generally speaking you can count on 2k characters as a reasonable limit.一般来说,您可以将 2k 个字符作为合理的限制。

You also want to avoid reading all of your keywords into memory at once, just loop over the file directly to read keywords from it as needed (up to what will fit in a single API query).您还希望避免一次将所有关键字读入内存,只需直接循环遍历文件即可根据需要从中读取关键字(最多适合单个 API 查询)。 Your input file doesn't appear to be a CSV format since every line is just a single keyword, no need to use the csv module in that case.您的输入文件似乎不是 CSV 格式,因为每一行都只是一个关键字,在这种情况下无需使用csv模块。

Next, you want to use a better HTTP library, one that can reuse open connections .接下来,您想使用一个更好的 HTTP 库,一个可以重用开放连接的库 This reduces the strain both on your network side and on the API service.这减少了您的网络端和 API 服务的压力。 I recommend you use the requests library , it offers the most user-friendly API for this.我建议您使用requests,它为此提供了最用户友好的 API。 Use a single session object to ensure the connection is reused where possible.使用单个会话对象以确保在可能的情况下重用连接。

Writing out to separate files for every 10k results is trivial enough, just keep a counter.为每 10k 个结果写出单独的文件就足够了,只需保留一个计数器即可。 I've created a new class below that encapsulates that behaviour, and output files are closed when you reach the maximum, and a new file is opened as more rows are written, automatically.我在下面创建了一个新类来封装该行为,当达到最大值时将关闭输出文件,并在写入更多行时自动打开一个新文件。 The files produced use a sequential numbering scheme, so examplefile-1.csv , examplefile-2.csv , etc.生成的文件使用顺序编号方案,因此examplefile-1.csvexamplefile-2.csv等。

I've also encapsulated the keyword grouping and API calls into separate functions, to aid readability:我还将关键字分组和 API 调用封装到单独的函数中,以提高可读性:

import requests

url = 'http://api.grepwords.com/lookup'
api_key = 'xxx'
keywords_filename = "example.csv"
output_filename = 'examplefile-{}.csv'  # a template, will get an order number

base_query = {
    'api_key': api_key,
    'country': 'united_kingdom',
}

# determine a maximum length for the keywords query
base_length = len(requests.Request('GET', url, params=base_query).prepare().url) + 3  # url + the &q= characters
max_length = 2000 - base_length  # max length of the keywords plus | symbols


class RotatingCSVOutput:
    """CSV writer object that rotates files after every max_rows rows have been written

    A new file is only opened if there are rows to write

    """
    def __init__(self, base_filename, *args, max_rows=10000, **kwargs):
        self._base = base_filename
        self._max = max_rows
        self._open_file = None
        self._writer = None
        self._rows_written = 0
        self._file_counter = 1
        self._args, self._kwargs = args, kwargs

    def _open(self):
        self._open_file = open(self._base.format(self._file_counter), 'w', newline='')
        self._writer = csv.writer(self._open_file, *self._args, **self._kwargs)
        self._file_counter += 1

    def _close(self):
        if self._open_file is not None:
            self._open_file.close()
            self._open_file = None
            self._writer = None

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self._close()

    def writerow(self, row):
        if self._open_file is None:
            self._open()
        self._writer.writerow(row)
        self._rows_written += 1
        if self._rows_written % self._max == 0:
            # close after max row count is reached; a new file is opened on demand.
            self._close()

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)


def max_keywords(keywords, max):
    """Produce a list of keywords from the input sequence fitting within max characters

    The length of each keyword is padded with 1 character to allow for the separator

    """
    group = []
    length = 0

    for keyword in keywords:
        if length + len(keyword) > max:
            # our group is filled up, yield it and set up a new group
            yield group

            group, length = [], 0

        group.append(keyword)
        length += len(keyword) + 1  # count the | separator too

    if group:
        # last of the keywords at the end of the file
        yield group


def fetch_impressions(session, keywords):
    """Fetch the impressions value for each of the keywords from the API"""
    parameters = {'q': '|'.join(keywords), **base_query}
    response = session.get(url, params=parameters)
    response.raise_for_status()  # abort if there is an error code
    data = response.json().get('keywords', {})
    for keyword in keywords:
        yield keyword, data.get(kw, {}).get('ams', 'NA')


with open(keywords_filename, 'r') as input_file,\
        requests.Session() as session,\
        RotatingCSVOutput(output_filename) as output:

    stripped_lines = (line.strip() for line in input_file)
    for keywords in max_keywords(stripped_lines, max_length)
        result_rows = fetch_impressions(session, keywords)
        output.writerows(result_rows)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM