[英]Web scraping multiple pages in python and writing it into a csv file
I am new to web scraping and I am trying to scrape all the video links from each page of this specific site and writing that into a csv file.我是 web 抓取的新手,我正在尝试从该特定站点的每个页面抓取所有视频链接并将其写入 csv 文件。 For starters I am trying to scrape the URLs from this site:
对于初学者,我正在尝试从该站点抓取 URL:
https://search.bilibili.com/all?keyword=%E3%82%A2%E3%83%8B%E3%82%B2%E3%83%A9%EF%BC%81%E3%83%87%E3%82%A3%E3%83%89%E3%82%A5%E3%83%BC%E3%83%BC%E3%83%B3 https://search.bilibili.com/all?keyword=%E3%82%A2%E3%83%8B%E3%82%B2%E3%83%A9%EF%BC%81%E3%83%87 %E3%82%A3%E3%83%89%E3%82%A5%E3%83%BC%E3%83%BC%E3%83%B3
and going through all 19 pages.并浏览所有 19 页。 The problem I'm encountering is that the same 20 video links are being written 19 times(because I'm trying to go through all 19 pages), instead of having (around) 19 distinct sets of URLs.
我遇到的问题是相同的 20 个视频链接被写入了 19 次(因为我试图通过所有 19 个页面 go),而不是拥有(大约)19 个不同的 URL 集。
import requests
from bs4 import BeautifulSoup
from csv import writer
def make_soup(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def scrape_url():
for video in soup.find_all('a', class_='img-anchor'):
link = video['href'].replace('//','')
csv_writer.writerow([link])
with open("videoLinks.csv", 'w') as csv_file:
csv_writer = writer(csv_file)
header = ['URLS']
csv_writer.writerow(header)
url = 'https://search.bilibili.com/all?keyword=%E3%82%A2%E3%83%8B%E3%82%B2%E3%83%A9%EF%BC%81%E3%83%87%E3%82%A3%E3%83%89%E3%82%A5%E3%83%BC%E3%83%BC%E3%83%B3'
soup = make_soup(url)
lastButton = soup.find_all(class_='page-item last')
lastPage = lastButton[0].text
lastPage = int(lastPage)
#print(lastPage)
page = 1
pageExtension = ''
scrape_url()
while page < lastPage:
page = page + 1
if page == 1:
pageExtension = ''
else:
pageExtension = '&page='+str(page)
#print(url+pageExtension)
fullUrl = url+pageExtension
make_soup(fullUrl)
scrape_url()
Any help is much appreciated and I decided to code this specific way so that I can better generalize this throughout the BiliBili site.非常感谢任何帮助,我决定以这种特定方式编写代码,以便我可以在整个 BiliBili 网站上更好地概括这一点。
A screenshot is linked below showing how the first link repeats a total of 19 times:下面链接的屏幕截图显示了第一个链接如何重复总共 19 次:
Try尝试
soup = make_soup(fullurl)
in last but one line在最后一行
In the second to last line, you are not assigning the return value of make_soup
.在倒数第二行中,您没有分配
make_soup
的返回值。 In your scrape_url
function, you are using a variable called soup
, but that only gets assigned once.在您的
scrape_url
function 中,您使用了一个名为soup
的变量,但它只被分配一次。
If you changed this line to soup = scrape_url()
then it should work.如果您将此行更改为
soup = scrape_url()
那么它应该可以工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.