简体   繁体   English

BeautifulSoup - 我如何抓取多个链接然后抓取链接的内容

[英]BeautifulSoup - how do i scrape multiple links to then scrape contents of links

I'm trying to do a scrape where the landing page has various links (the 5 sub categories at the top): https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html我正在尝试在登录页面有各种链接(顶部的 5 个子类别)的地方进行抓取: https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.ZFC35FDC70D5FC69D269883A822C7A5E

Within each of these categories are a list of products https://mcavoyguns.co.uk/contents/en-uk/d411_Browning_B725_Shotguns.html每个类别中都有一个产品列表https://mcavoyguns.co.uk/contents/en-uk/d411_Browning_B725_Shotguns.html

Each product listed has a link to get further details (a direct link to the product as an individual page) https://mcavoyguns.co.uk/contents/en-uk/p74600_Browning-B725-Sporter-over-_-under.html列出的每个产品都有一个链接以获取更多详细信息(作为单个页面的产品直接链接) https://mcavoyguns.co.uk/contents/en-uk/p74600_Browning-B725-Sporter-over-_-under。 html

The scrape I've put together so far will get as far as creating a list of all the individual page links required.到目前为止,我收集的内容将创建一个包含所有所需单个页面链接的列表。 But when I try to loop each individual product link for data, I cant seem to get the BeautifulSoup to map the data from those links.但是当我尝试循环每个单独的产品链接以获取数据时,我似乎无法从这些链接中获取 BeautifulSoup 到 map 的数据。 Its as though it stays on the previous page (if you will).就好像它停留在上一页(如果你愿意的话)。
What am I missing to allow for that second "bounce" to the "product_link" address (eg https://mcavoyguns.co.uk/contents/en-uk/p74600_Browning-B725-Sporter-over-_-under.html ) and allow me to scrape the data from there?我缺少什么以允许第二次“反弹”到“product_link”地址(例如https://mcavoyguns.co.uk/contents/en-uk/p74600_Browning-B725-Sporter-over-_-under.ZFC35FDC70D5FC69D269883A82EZ35A )并允许我从那里抓取数据? I had thought I might need to add a time.sleep(5) timer to allow for all to load but still getting nothing.我原以为我可能需要添加一个 time.sleep(5) 计时器以允许所有人加载但仍然一无所获。

Code:代码:

from bs4 import BeautifulSoup 
import math 
import requests 
import shutil 
import csv 
import pandas 
import numpy as np 
from pandas import DataFrame 
import re
import os 
import urllib.request as urllib2 
import locale 
import json 
from selenium import webdriver 
import lxml.html 
import time 
from selenium.webdriver.support.ui import Select  
os.environ["PYTHONIOENCODING"] = "utf-8" 


#selenium requests 

browser = webdriver.Chrome(executable_path='C:/Users/admin/chromedriver.exe')
browser.get("https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html") 
time.sleep(2) 

all_Outlinks=[] 
all_links=[]

soup = BeautifulSoup(browser.page_source, features="lxml") 
submenuFind = soup.find("div", "idx2Submenu") 
submenuItems = submenuFind.find_all("li", "GC34 idx2Sub") 

for submenuItem in submenuItems: 
    for link in submenuItem.select('a[href]'): 
        all_Outlinks.append("https://mcavoyguns.co.uk/contents/en-uk/" + link['href']) 
#print(all_Outlinks) 

for a_link in all_Outlinks:
    res = requests.get(a_link) 
    soup = BeautifulSoup(res.text, 'html.parser') 
    pageLinkDivs = soup.find_all("div", "column full")
    for pageLinkDiv in pageLinkDivs:
        for pageLink in pageLinkDiv.select('a[href]'):
            all_links.append("https://mcavoyguns.co.uk/contents/en-uk/" + pageLink['href'])
#print(all_links)
            
for product_link in all_links:
    time.sleep(5)
    resSecond = requests.get(product_link)
    soup = BeautifulSoup(resSecond.text, 'html.parser')
    model = soup.find("div", "GC75 ProductChoiceName")
    print(model)

PS Apologies for the additional imports. PS 为额外的进口道歉。 They are copy and paste from a previous script to be removed once confirmed not required.它们是从以前的脚本中复制和粘贴的,一旦确认不需要,它们就会被删除。

That info is pulled dynamically from a script tag when using browser.使用浏览器时,该信息是从脚本标签中动态提取的。 As using requests this will not be in the location you might be looking.在使用请求时,这将不在您可能正在寻找的位置。 Instead, pull that info from the script tag.相反,从脚本标签中提取该信息。

In this case, I pull all the info related to a given model that is within the script and generate a dataframe.在这种情况下,我提取与脚本中给定 model 相关的所有信息并生成 dataframe。 I convert the string inside the script tag to a python object with ast .我将脚本标签内的字符串转换为带有ast的 python object 。 I add the product url and product title to the dataframe.我将产品 url 和产品标题添加到 dataframe。

Each df is added to a list which is converted to a final dataframe.每个 df 都被添加到一个列表中,该列表被转换为最终的 dataframe。 As I don't know what final header names would be required I have left some with their default names.因为我不知道最终需要什么 header 名称,所以我留下了一些默认名称。

I have added in handling for the case(s) where there are no model options listed for the given product.对于给定产品没有列出 model 选项的情况,我已添加处理。


from bs4 import BeautifulSoup 
import math 
import requests 
import shutil 
import csv 
import pandas as pd
import numpy as np 
import re
import os 
import urllib.request as urllib2 
import locale 
import json 
from selenium import webdriver 
import lxml.html 
import time 
from selenium.webdriver.support.ui import Select  
import ast

os.environ["PYTHONIOENCODING"] = "utf-8" 

#selenium requests 
browser = webdriver.Chrome(executable_path='C:/Users/admin/chromedriver.exe')
browser.get("https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html") 
time.sleep(2) 

all_Outlinks=[] 
all_links=[]

soup = BeautifulSoup(browser.page_source, features="lxml") 
submenuFind = soup.find("div", "idx2Submenu") 
submenuItems = submenuFind.find_all("li", "GC34 idx2Sub") 

for submenuItem in submenuItems: 
    for link in submenuItem.select('a[href]'): 
        all_Outlinks.append("https://mcavoyguns.co.uk/contents/en-uk/" + link['href']) 
#print(all_Outlinks) 

with requests.Session() as s:
    
    for a_link in all_Outlinks:
        res = requests.get(a_link) 
        soup = BeautifulSoup(res.text, 'html.parser') 
        pageLinkDivs = soup.find_all("div", "column full")
        for pageLinkDiv in pageLinkDivs:
            for pageLink in pageLinkDiv.select('a[href]'):
                all_links.append("https://mcavoyguns.co.uk/contents/en-uk/" + pageLink['href'])
    
    results = []
    
    for product_link in all_links:
        # print(product_link)
        resSecond = s.get(product_link)
        soup = BeautifulSoup(resSecond.text, 'html.parser')
        title = soup.select_one('.ProductTitle').text
        
        try:
            df = pd.DataFrame(ast.literal_eval(re.search(r'(\[\[.*\]\])', soup.select_one('.ProductOptions script').string).groups(0)[0]))
            df.iloc[:, -1] = product_link
        except:
            placeholder = ['No options listed'] * 8
            placeholder.append(product_link)
            df = pd.DataFrame([placeholder])
        
        df.insert(0, 'title', title)
        
        #print(df) # add headers you care about to df or do that at end on full list
        results.append(df)
final = pd.concat(results) # or add header here
print(final)

You could then look at speeding/tidying things up:然后你可以看看加速/整理东西:

from bs4 import BeautifulSoup 
import requests 
import pandas as pd
import re
import os 
import locale 
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
import ast
from multiprocessing import Pool, cpu_count

def get_models_df(product_link):
    res = requests.get(product_link)
    soup = BeautifulSoup(res.text, 'lxml')
    title = soup.select_one('.ProductTitle').text

    try:
        df = pd.DataFrame(ast.literal_eval(re.search(r'(\[\[.*\]\])', soup.select_one('.ProductOptions script').string).groups(0)[0]))
        df.iloc[:, -1] = product_link
    except:
        placeholder = ['No options listed'] * 8
        placeholder.append(product_link)
        df = pd.DataFrame([placeholder])

    df.insert(0, 'title', title)
    return(df)


def get_all_pages(a_link):
    res = requests.get(a_link) 
    soup = BeautifulSoup(res.text, 'lxml') 
    all_links = ["https://mcavoyguns.co.uk/contents/en-uk/" + i['href'] for i in soup.select('.center-content > a')]   
    return all_links

if __name__ == '__main__':
    os.environ["PYTHONIOENCODING"] = "utf-8" 

    #selenium requests 
    browser = webdriver.Chrome(executable_path='C:/Users/admin/chromedriver.exe')
    browser.get("https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html") 
    all_outlinks = [i.get_attribute('href') for i in WebDriverWait(browser,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".idx2Submenu a")))]
    browser.quit()
    
    with Pool(cpu_count()-1) as p:

        nested_links = p.map(get_all_pages , all_outlinks)
        flat_list = [link for links in nested_links for link in links]   
        results = p.map(get_models_df, flat_list)
        final = pd.concat(results)
        #print(final)
        final.to_csv('guninfo.csv', encoding='utf-8-sig', index = False)

So I said I would have a look at the other requested items and they are indeed available just with requests .所以我说我会看看其他请求的项目,它们确实只有requests可用。 Some things that needed handling:一些需要处理的事情:

  1. Different headers present for different products;不同的产品有不同的标题; some missing headers一些缺少的标题
  2. Some unicode characters (there are still some encoding things to look at)一些 unicode 字符(还有一些编码的东西要看)
  3. Handling cases where description missing处理缺少描述的案例
  4. Handling the more section处理更多部分
  5. Updating certain output values so Excel doesn't convert them to dates更新某些 output 值,因此 Excel 不会将它们转换为日期
  6. Handling of header nan header nan的处理

TODO:去做:

  1. One of the functions has now become a rabid monster and needs re-factoring into smaller friendly function calls.其中一个函数现在已经成为一个疯狂的怪物,需要重新分解为更小的友好 function 调用。

from bs4 import BeautifulSoup 
import requests 
import pandas as pd
import re
import os 
import locale 
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
import ast
from multiprocessing import Pool, cpu_count
import numpy as np
import unicodedata

def get_models_df(product_link):

    resSecond = requests.get(product_link)
    soup = BeautifulSoup(resSecond.text, 'lxml')
    title = soup.select_one('.ProductTitle').text

    try:
        df = pd.DataFrame(ast.literal_eval(re.search(r'(\[\[.*\]\])', soup.select_one('.ProductOptions script').string).groups(0)[0]))
        
    except:
        placeholder = ['No options listed'] * 8
        df = pd.DataFrame([placeholder])
    
    df.insert(0, 'title', title)
    df['price'] = ' '.join([soup.select_one("[property='product:price:amount']")['content'], 
                   soup.select_one("[property='product:price:currency']")['content']])
    df['weight'] = ' '.join([soup.select_one("[property='product:weight:value']")['content'], 
                    soup.select_one("[property='product:weight:units']")['content']])

    output_headers = ['Action frame', 'Barrel','Barrel finish','Barrel length', 
                      'Barrel length (mm-inch)','Buttstock','Calibre','Chokes','Code',
                      'Drop at comb','Drop at heel','Forearm','Length','N/A','Notes',
                      'Options','Packaging','Sights','Stock style','Top rib','Weight','Wood','Wood grade'
                     ]
    
    df = pd.concat([df, pd.DataFrame(columns = output_headers)])
    
    try:
        description_table = pd.read_html(str(soup.select_one('.ProductDetailedDescription table, table')))[0].transpose()
        description_table.dropna(axis=0, how='all',inplace=True)
        headers = list(description_table.iloc[0,:])
        headers[:] = ['N/A' if pd.isnull(np.array([header], dtype=object)) else header for header in headers]
        
        for number, header in enumerate(headers):
            temp = header.lower()
            value = description_table.iloc[1, number]
            if temp == 'calibre':
                df[header] = "'" + value
            elif  temp == 'top rib' and 'mm' not in value:
                df[header] = value + 'mm'
            else:
                df[header] = value
     
    except:
        pass # no table
        
    description = soup.select_one('#ProductDetailsTab [title=More]')
    
    if description is None:
        desc = 'N/A'
    else:
        desc = '. '.join([i.text for i in soup.select('.ProductDescription li, .ProductDescription span') if i.text !=''])
        if desc == '':
            desc = soup.select_one('.ProductIntroduction').get_text()

    df['desc'] = unicodedata.normalize('NFKD', desc)   
    df['product_link'] = product_link
    
    return(df)

def get_all_pages(a_link):
        
    res = requests.get(a_link) 
    soup = BeautifulSoup(res.text, 'lxml') 
    all_links = ["https://mcavoyguns.co.uk/contents/en-uk/" + i['href'] for i in soup.select('.center-content > a')]

    return all_links

if __name__ == '__main__':
    #os.environ["PYTHONIOENCODING"] = "utf-8" 

    #selenium requests 
    browser = webdriver.Chrome()# executable_path='C:/Users/admin/chromedriver.exe')
    browser.get("https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html") 
    all_outlinks = [i.get_attribute('href') for i in WebDriverWait(browser,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".idx2Submenu a")))]
    browser.quit()

    with Pool(cpu_count()-1) as p:

        nested_links = p.map(get_all_pages , all_outlinks)
        flat_list = [link for links in nested_links for link in links]
        results = p.map(get_models_df, flat_list)
        final = pd.concat(results)
        #print(final)
        final.to_csv('guninfo.csv', encoding='utf-8-sig', index = False)
        
        

As QHarr pointed out Selenium was the answer.正如 QHarr 指出的那样,Selenium 就是答案。 This gave me the direction to look at it with different eyes and allowed me to find the answer.这给了我以不同的眼光看待它的方向,让我找到了答案。

I'm posting as my answer but crediting @QHarr with the work based on work provided previous and the ongoing assistance to help lead to the solution.我将发布作为我的答案,但将 @QHarr 归功于基于之前提供的工作和持续的帮助以帮助解决问题的工作。

from bs4 import BeautifulSoup
import math
import requests
import shutil
import csv
import pandas
import numpy as np
from pandas import DataFrame
import re
import os
import urllib.request as urllib2
import locale
import json
from selenium import webdriver
import lxml.html
import time
from selenium.webdriver.support.ui import Select 
os.environ["PYTHONIOENCODING"] = "utf-8"

#selenium requests
browser = webdriver.Chrome(executable_path='C:/Users/andrew.glass/chromedriver.exe')
browser.get("https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html") 
time.sleep(2) 

all_Outlinks=[] 
all_links=[]

soup = BeautifulSoup(browser.page_source, features="lxml") 
submenuFind = soup.find("div", "idx2Submenu") 
submenuItems = submenuFind.find_all("li", "GC34 idx2Sub") 

for submenuItem in submenuItems: 
    for link in submenuItem.select('a[href]'): 
        all_Outlinks.append("https://mcavoyguns.co.uk/contents/en-uk/" + link['href']) 
#print(all_Outlinks) 

for a_link in all_Outlinks:
    res = requests.get(a_link) 
    soup = BeautifulSoup(res.text, 'html.parser') 
    pageLinkDivs = soup.find_all("div", "column full")
    for pageLinkDiv in pageLinkDivs:
        for pageLink in pageLinkDiv.select('a[href]'):
            all_links.append("https://mcavoyguns.co.uk/contents/en-uk/" + pageLink['href'])
#print(all_links)
            
for product_link in all_links:
    
    browser.get(product_link)
    time.sleep(5)
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    model = soup.find("div", "GC65 ProductOptions")
    modelFind = soup.find('select', attrs={'name': re.compile('model')})
    modelList = [x['origvalue'][:14] for x in modelFind.find_all('option')[1:]]
    print(modelList)

Model print still a bit messy but can clean it up once get all requirements gathered. Model 打印仍然有点乱,但是一旦收集了所有要求就可以清理它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM