简体   繁体   English

Web 抓取 python 中的多个页面并将其写入 csv 文件

[英]Web scraping multiple pages in python and writing it into a csv file

I am new to web scraping and I am trying to scrape all the video links from each page of this specific site and writing that into a csv file.我是 web 抓取的新手,我正在尝试从该特定站点的每个页面抓取所有视频链接并将其写入 csv 文件。 For starters I am trying to scrape the URLs from this site:对于初学者,我正在尝试从该站点抓取 URL:

https://search.bilibili.com/all?keyword=%E3%82%A2%E3%83%8B%E3%82%B2%E3%83%A9%EF%BC%81%E3%83%87%E3%82%A3%E3%83%89%E3%82%A5%E3%83%BC%E3%83%BC%E3%83%B3 https://search.bilibili.com/all?keyword=%E3%82%A2%E3%83%8B%E3%82%B2%E3%83%A9%EF%BC%81%E3%83%87 %E3%82%A3%E3%83%89%E3%82%A5%E3%83%BC%E3%83%BC%E3%83%B3

and going through all 19 pages.并浏览所有 19 页。 The problem I'm encountering is that the same 20 video links are being written 19 times(because I'm trying to go through all 19 pages), instead of having (around) 19 distinct sets of URLs.我遇到的问题是相同的 20 个视频链接被写入了 19 次(因为我试图通过所有 19 个页面 go),而不是拥有(大约)19 个不同的 URL 集。

import requests 
from bs4 import BeautifulSoup
from csv import writer 

def make_soup(url): 
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def scrape_url():
    for video in soup.find_all('a', class_='img-anchor'):
        link = video['href'].replace('//','')
        csv_writer.writerow([link])

with open("videoLinks.csv", 'w') as csv_file:
        csv_writer = writer(csv_file)
        header = ['URLS']
        csv_writer.writerow(header)

        url = 'https://search.bilibili.com/all?keyword=%E3%82%A2%E3%83%8B%E3%82%B2%E3%83%A9%EF%BC%81%E3%83%87%E3%82%A3%E3%83%89%E3%82%A5%E3%83%BC%E3%83%BC%E3%83%B3'
        soup = make_soup(url)

        lastButton = soup.find_all(class_='page-item last')
        lastPage = lastButton[0].text
        lastPage = int(lastPage)
        #print(lastPage)

        page = 1
        pageExtension = ''

        scrape_url()

        while page < lastPage:
            page = page + 1
            if page == 1:
                pageExtension = ''
            else:
                pageExtension = '&page='+str(page)
            #print(url+pageExtension)
            fullUrl = url+pageExtension
            make_soup(fullUrl)
            scrape_url()

Any help is much appreciated and I decided to code this specific way so that I can better generalize this throughout the BiliBili site.非常感谢任何帮助,我决定以这种特定方式编写代码,以便我可以在整个 BiliBili 网站上更好地概括这一点。

A screenshot is linked below showing how the first link repeats a total of 19 times:下面链接的屏幕截图显示了第一个链接如何重复总共 19 次:

csv文件截图

Try尝试

soup = make_soup(fullurl)

in last but one line在最后一行

In the second to last line, you are not assigning the return value of make_soup .在倒数第二行中,您没有分配make_soup的返回值。 In your scrape_url function, you are using a variable called soup , but that only gets assigned once.在您的scrape_url function 中,您使用了一个名为soup的变量,但它只被分配一次。

If you changed this line to soup = scrape_url() then it should work.如果您将此行更改为soup = scrape_url()那么它应该可以工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM