[英]Dynamic web scraping in Python using BeautifulSoup and Pandas
I created a web scraper that pulls data from a single web page using Python.我创建了一个 web 刮板,它使用 Python 从单个 web 页面中提取数据。 However, I'm having trouble creating a loop that iterates until all records have been scraped while being careful to not duplicate records.但是,我无法创建一个循环,直到所有记录都被刮掉,同时小心不要重复记录。
It is clear that the only changing piece of the URL is the "start=" portion.很明显,URL 的唯一变化部分是“start=”部分。
What is the easiest way to go about adding a dynamic loop without overcomplicating? go 关于添加动态循环而不过度复杂的最简单方法是什么?
URL page 1: URL 第 1 页:
https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=1#anchor1 https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=1#anchor1
URL Page 2: URL 第 2 页:
https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=51#anchor1 https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=51#anchor1
Final Page URL:最后一页 URL:
https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=751#anchor1 https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=751#anchor1
#Imports
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
from datetime import date
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
#Set URL
URL = "https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=1#anchor1"
res = requests.get(URL)
soup = BeautifulSoup(res.content,'lxml')
#Define specific table
table = soup.find("table", attrs={"class": "table wb-cl-table"})
df = pd.read_html(str(table))[0]
#Add Listing_ID's
tbody = table.find("tbody")
df['Listing_ID'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tbody.find_all('a')]
df
I wrote the code on the assumption that each page is fixed at 50 lines.我在假设每页固定为 50 行的情况下编写代码。 for i in range(1,752,50):
loops from 1 line to 751 lines in 50-line increments. for i in range(1,752,50):
以 50 行为增量从 1 行循环到 751 行。
#Imports
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
from datetime import date
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
# Empty Dataframe create
all_data = pd.DataFrame(index=[], columns=['Varietal','Type','Appellation','Qty','Price','Date','Listing_ID'])
# last record[...,701,751,782] and range(1,50*n+2,50)
last = 782
lst = [x if x <= last else last for x in range(1,802,50)]
for i in lst:
#Set URL
URL = "https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start={}#anchor1".format(i)
print(URL)
res = requests.get(URL)
soup = BeautifulSoup(res.content,'lxml')
#Define specific table
table = soup.find("table", attrs={"class": "table wb-cl-table"})
df = pd.read_html(str(table))[0]
#Add Listing_ID's
tbody = table.find("tbody")
df['Listing_ID'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tbody.find_all('a')]
all_data = pd.concat([all_data, df], ignore_index=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.