简体   繁体   中英

Scraping text from meta on website without surrounding tags

I'm trying to scrape a weblink from the of the below HTML without the surrounding values. This is done for multiple iterations on the same page.

The html:

<div class="facets-item-cell-list-right" = $0
    <meta itemprop="url" content"https://www.website.com/linktoitem1">
        <div class="title">
              <span itemprop="name"> item1 </span>

My code:

item = []
link = []

for product in soup.select('div.facets-item-cell-list-right'):
    item.append(product.span.get_text(strip=True))
    link.append(product.meta)
   
print("Setup Complete", *link, sep='\n')
print("Setup Complete", *item, sep='\n')

This prints:

   <meta content="https://www.website.com/linktoitem1" itemprop="url"/>
   <meta content="https://www.website.com/linktoitem2" itemprop="url"/>
   etc

How can I make it so that the first print function prints only

   https://www.website.com/linktoitem1
   https://www.website.com/linktoitem2
   etc

Is this what you want?

from bs4 import BeautifulSoup

sample = """
<meta content="https://www.website.com/linktoitem1" itemprop="url"/>
<meta content="https://www.website.com/linktoitem2" itemprop="url"/>
"""

link = []
for m in BeautifulSoup(sample, "html.parser").find_all("meta"):
    link.append(m["content"])

print("\n".join(link))

Output:

https://www.website.com/linktoitem1
https://www.website.com/linktoitem2

EDIT: based on the link you've shared, try this:

import requests
from bs4 import BeautifulSoup

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.152 Safari/537.36"
}
url = "https://www.nickollsandperks.co.uk/New-and-Special-Offers/New-Whisky?order=relevance:asc"
page = requests.get(url, headers=headers).text

link = []
for m in BeautifulSoup(page, "html.parser").select('div.facets-item-cell-list-right'):
    link.append(m.find("meta")["content"])

print("\n".join(link))

Output:

https://www.nickollsandperks.co.uk/Longmorn-18-Year-Old-2002-Chorlton-Whisky
https://www.nickollsandperks.co.uk/Bruichladdich-15-Year-Old-2005-Chorlton-Whisky
https://www.nickollsandperks.co.uk/Inchfad-13-Year-Old-2007-Chorlton-Whisky
https://www.nickollsandperks.co.uk/Port-Charlotte-12-Year-Old-2007-Valinch-STC01
https://www.nickollsandperks.co.uk/Longrow-Red-10-Year-Old-Malbec-Cask
https://www.nickollsandperks.co.uk/Kilkerran-8-Year-Old-Cask-Strength-56-9
https://www.nickollsandperks.co.uk/Springbank-18-Year-Old
https://www.nickollsandperks.co.uk/Kilkerran-12-Year-Old-46_1
https://www.nickollsandperks.co.uk/Tomatin-Decades-II
https://www.nickollsandperks.co.uk/Ardbeg-Traigh-Bhan-19-Year-Old-Batch-2
https://www.nickollsandperks.co.uk/Ardbeg-Arrrrrrrdbeg-Committee-Release
https://www.nickollsandperks.co.uk/Teaninich-13-Year-Old-2005-cask-487-Single-Cask-Nation
https://www.nickollsandperks.co.uk/Tomatin-12-Year-Old-2006-cask-800230-Single-Cask-Nation
https://www.nickollsandperks.co.uk/Trinidadian-Rum-16-Year-Old-2003-cask-3-Single-Cask-Nation
https://www.nickollsandperks.co.uk/Craigellachie-13-Year-Old-2005-cask-314984-Single-Cask-Nation
https://www.nickollsandperks.co.uk/Blended-Malt-9-Year-Old-2009-cask-417-Single-Cask-Nation
https://www.nickollsandperks.co.uk/Invergordon-45-Year-Old-1974-cask-7844000025-Single-Cask-Nation
https://www.nickollsandperks.co.uk/Aberfeldy-28-Year-Old-1991-cask-7435-Single-Cask-Nation
https://www.nickollsandperks.co.uk/Glen-Elgin-10-Year-Old-2010-cask-801386-Single-Cask-Nation
https://www.nickollsandperks.co.uk/Kentucky-Bourbon-24-Year-Old-1994-Single-Cask-Nation
https://www.nickollsandperks.co.uk/The-Macallan-Edition-No-3
https://www.nickollsandperks.co.uk/Craigellachie-26-Year-Old-1994-Connoisseurs-Choice-Gordon-And-MacPhail
https://www.nickollsandperks.co.uk/Strathisla-33-Year-Old-1987-Connoisseurs-Choice-Gordon-And-MacPhail
https://www.nickollsandperks.co.uk/Glen-Grant-30-Year-Old-1990-Connoisseurs-Choice-Gordon-And-MacPhail

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM