简体   繁体   English

Python Web 抓取 - 如何抓取此类网站?

[英]Python Web Scraping - How to scrape this type of site?

Okay, so I need to scrape the following webpage: https://www.programmableweb.com/category/all/apis?deadpool=1好的,所以我需要抓取以下网页: https://www.programmableweb.com/category/all/apis?deadpool=1

It's a list of APIs.这是一个 API 列表。 There are approx 22,000 APIs to scrape.大约有 22,000 个 API 可供抓取。


I need to:我需要:

1) Get the URL of each API in the table (pages 1-889), and also to scrape the following info: 1)获取表格中每个API的URL(第1-889页),并刮取以下信息:

  • API name API 名称
  • Description描述
  • Category类别
  • Submitted已提交

2) I then need to scrape a bunch of information from each URL. 2)然后我需要从每个 URL 中抓取一堆信息。

3) Export the data to a CSV 3) 将数据导出到 CSV


The thing is, I'm a bit lost of how to think about this project.问题是,我有点迷失了如何思考这个项目。 From what I can see, there are no AJAX calls been made to populate the table, which means I'm going to have to parse the HTML directly (right?)据我所知,没有调用 AJAX 来填充表,这意味着我将不得不直接解析 HTML(对吗?)


In my head, the logic would be something like this:在我看来,逻辑是这样的:

  1. Use the requests & BS4 libraries to scrape the table使用 requests 和 BS4 库来刮桌子

  2. Then, somehow grab the HREF from every row然后,以某种方式从每一行中获取 HREF

  3. Access that HREF, scrape the data, move onto the next one访问那个 HREF,抓取数据,移动到下一个

  4. Rinse and repeat for all table rows.冲洗并重复所有表格行。


Am I on the right track, is this possible with requests & BS4?我在正确的轨道上,这可能与请求和 BS4 吗?

Here's are some screenshots of what I've been trying to explain.这是我一直试图解释的一些截图

Thank you SOO much for any help.非常感谢你的帮助。 This is hurting my head haha这让我头疼哈哈

You should read more about scraping if you are going to pursue it.如果你要追求它,你应该阅读更多关于抓取的信息。

from bs4 import BeautifulSoup
import csv , os , requests
from urllib import parse


def SaveAsCsv(list_of_rows):
    try:
        with open('data.csv', mode='a',  newline='', encoding='utf-8') as outfile:
            csv.writer(outfile).writerow(list_of_rows)
    except PermissionError:
        print("Please make sure data.csv is closed\n")

if os.path.isfile('data.csv') and os.access('data.csv', os.R_OK):
    print("File data.csv Already exists \n")
else:
    SaveAsCsv([ 'api_name','api_link','api_desc','api_cat'])
BaseUrl = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page={}'
for i in range(1, 890):
    print('## Getting Page {} out of 889'.format(i))    
    url = BaseUrl.format(i)
    res = requests.get(url)
    soup = BeautifulSoup(res.text,'html.parser')
    table_rows = soup.select('div.view-content > table[class="views-table cols-4 table"] > tbody tr')
    for row in table_rows:
        tds = row.select('td')
        api_name = tds[0].text.strip()
        api_link = parse.urljoin(url, tds[0].find('a').get('href'))
        api_desc = tds[1].text.strip()
        api_cat  = tds[2].text.strip()  if len(tds) >= 3 else ''
        SaveAsCsv([api_name,api_link,api_desc,api_cat])

Here we go using requests , BeautifulSoup and pandas :在这里,我们使用requests go , BeautifulSouppandas

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page='

num = int(input('How Many Page to Parse?> '))
print('please wait....')
name = []
desc = []
cat = []
sub = []
for i in range(0, num):
    r = requests.get(f"{url}{i}")
    soup = BeautifulSoup(r.text, 'html.parser')
    for item1 in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}):
        name.append(item1.text)
    for item2 in soup.findAll('td', attrs={'class': 'views-field views-field-search-api-excerpt views-field-field-api-description hidden-xs visible-md visible-sm col-md-8'}):
        desc.append(item2.text)
    for item3 in soup.findAll('td', attrs={'class': 'views-field views-field-field-article-primary-category'}):
        cat.append(item3.text)
    for item4 in soup.findAll('td', attrs={'class': 'views-field views-field-created'}):
        sub.append(item4.text)

result = []
for item in zip(name, desc, cat, sub):
    result.append(item)

df = pd.DataFrame(
    result, columns=['API Name', 'Description', 'Category', 'Submitted'])
df.to_csv('output.csv')

print('Task Completed, Result saved to output.csv file.')

Result can be viewed online: Check Here结果可以在线查看: 检查这里

Output Simple: Output 简单:

在此处输入图像描述

Now For href parsing:现在对于href解析:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.programmableweb.com/category/all/apis?deadpool=0&page='

num = int(input('How Many Page to Parse?> '))
print('please wait....')

links = []
for i in range(0, num):
    r = requests.get(f"{url}{i}")
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}):
        for href in link.findAll('a'):
            result = 'https://www.programmableweb.com'+href.get('href')
            links.append(result)

spans = []
for link in links:
    r = requests.get(link)
    soup = soup = BeautifulSoup(r.text, 'html.parser')
    span = [span.text for span in soup.select('div.field span')]
    spans.append(span)

data = []
for item in spans:
    data.append(item)

df = pd.DataFrame(data)
df.to_csv('data.csv')
print('Task Completed, Result saved to data.csv file.')

Check Result Online: Here在线检查结果: 这里

Sample View is Below:示例视图如下:

在此处输入图像描述

In Case if you want those 2 csv files together so here's the code:如果您想要将这 2 个csv文件放在一起,那么代码如下:

import pandas as pd

a = pd.read_csv("output.csv")
b = pd.read_csv("data.csv")
merged = a.merge(b)
merged.to_csv("final.csv", index=False)

Online Result: Here在线结果: 这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM