简体   繁体   English

使用 BeautifulSoup 和 Z251D2BBFE9A3B95E5691CEB30DC6 在 Python 中动态 web 抓取

[英]Dynamic web scraping in Python using BeautifulSoup and Pandas

I created a web scraper that pulls data from a single web page using Python.我创建了一个 web 刮板,它使用 Python 从单个 web 页面中提取数据。 However, I'm having trouble creating a loop that iterates until all records have been scraped while being careful to not duplicate records.但是,我无法创建一个循环,直到所有记录都被刮掉,同时小心不要重复记录。

It is clear that the only changing piece of the URL is the "start=" portion.很明显,URL 的唯一变化部分是“start=”部分。

What is the easiest way to go about adding a dynamic loop without overcomplicating? go 关于添加动态循环而不过度复杂的最简单方法是什么?

URL page 1: URL 第 1 页:

https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=1#anchor1 https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=1#anchor1

URL Page 2: URL 第 2 页:

https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=51#anchor1 https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=51#anchor1

Final Page URL:最后一页 URL:

https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=751#anchor1 https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=751#anchor1

#Imports
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
from datetime import date 

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

#Set URL
URL = "https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start=1#anchor1"
res = requests.get(URL)
soup = BeautifulSoup(res.content,'lxml')

#Define specific table
table = soup.find("table", attrs={"class": "table wb-cl-table"})
df = pd.read_html(str(table))[0]

#Add Listing_ID's
tbody = table.find("tbody")
df['Listing_ID'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tbody.find_all('a')]
df

I wrote the code on the assumption that each page is fixed at 50 lines.我在假设每页固定为 50 行的情况下编写代码。 for i in range(1,752,50): loops from 1 line to 751 lines in 50-line increments. for i in range(1,752,50):以 50 行为增量从 1 行循环到 751 行。

#Imports
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
from datetime import date 

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Empty Dataframe create
all_data = pd.DataFrame(index=[], columns=['Varietal','Type','Appellation','Qty','Price','Date','Listing_ID'])

# last record[...,701,751,782] and range(1,50*n+2,50)
last = 782
lst = [x if x <= last else last for x in range(1,802,50)] 

for i in lst:
    #Set URL
    URL = "https://www.winebusiness.com/classifieds/grapesbulkwine/?sort_type=1&sort_order=desc&start={}#anchor1".format(i)
    print(URL)
    res = requests.get(URL)
    soup = BeautifulSoup(res.content,'lxml')

    #Define specific table
    table = soup.find("table", attrs={"class": "table wb-cl-table"})
    df = pd.read_html(str(table))[0]

    #Add Listing_ID's
    tbody = table.find("tbody")
    df['Listing_ID'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tbody.find_all('a')]
    all_data = pd.concat([all_data, df], ignore_index=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM