在 Beautifulsoup 中使用 xpath 解析 lxml 时没有得到 output

Question

当我尝试使用 beautifulsoup 从 Sephora 和 Ulta 抓取数据时，我可以获得页面的 html 内容。 然后，当我尝试使用 lxml 使用 xpath 解析它时，我没有得到任何 output。但是在 selenium 中使用相同的 xpath，我可以获得 output。

使用 Beautifulsoup

for i in range(len(df)):
    response = requests.get(df['product_url'].iloc[i])
    my_url=df['product_url'].iloc[i]
    My_url= ureq(my_url)
    my_html=My_url.read()
    My_url.close()
    soup = BeautifulSoup(my_html, 'html.parser')
    dom = et.HTML(str(soup))
#price
    try:
      price=(dom.xpath('//*[@id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span/text()'))
      df['price'].iloc[i]=price
    except:
      pass

使用 Selenium

lst=[]
urls=df['product_url']
for url in urls[:599]:
    time.sleep(1)
    driver.get(url)
    time.sleep(2)
    try:
         prize=driver.find_element('xpath','//*[@id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span').text
    except:
        pass
 
    lst.append([prize])
    pz=None
    dt=None

有谁知道为什么我不能使用 lxml 获取内容以使用 beautifulsoup 中的相同 xpath 解析它？ 非常感谢。

Ulta 示例链接：[1]： https://www.ulta.com/p/coco-mademoiselle-eau-de-parfum-spray-pimprod2015831

丝芙兰样品链接：[2]： https://www.sephora.com/product/coco-mademoiselle-P12495?skuId=513168&icid2=products

Answer 1

1.关于XPath

 driver.find_element('xpath','//*[@id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span').text

我有点惊讶 selenium 代码适用于您的丝芙兰链接 - 您提供的链接重定向到productnotcarried页面，但在此链接（例如）上，XPath没有匹配项。 您可以改用//p[@data-comp="Price "]//span/b 。

实际上，即使对于 Ulta，我更喜欢//*[@class="ProductHero__content"]//*[@class="ProductPricing"]/span只是为了人类可读性，尽管如果您将此路径与css 选择器一起使用，它看起来会更好

prize=driver.find_element("css selector", '*.ProductHero__content *.ProductPricing>span').text

[两个站点的编码 - Selenium]

要考虑到这两个站点，您可以设置类似于此参考词典的内容：

xRef = {
    'www.ulta.com': '//*[@id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span',
    'www.sephora.com': '//p[@data-comp="Price "]//span/b'
}

# for url in urls[:599]:... ################ REST OF CODE #############

然后相应地使用它

# from urllib.parse import urlsplit

# lst, urls, xRef = ....


# for url in urls[:599]:
    # sleep...driver.get...sleep...
    try:
         uxrKey = urlsplit(url).netloc
         prize = driver.find_element('xpath', xRef[uxrKey]).text
    except:
        # pass # you'll just be repeating whatever you got in the previous loop for prize
        # [also, if this happens in the first loop, an error will be raised at lst.append([prize])]

        prize = None # 'MISSING' # '' #
################ REST OF CODE #############

2. bs4+requests 抓取的局限性

我不知道et和ureq是什么，但是没有它们也可以解析来自requests.get的响应； 虽然 [afaik] bs4 没有任何 XPath 支持，但 css 选择器可以与.select一起使用。

      price = soup.select('.ProductHero__content .ProductPricing>span') # for Ulta

      price = soup.select('p[data-comp~="Price"] span>b') # for Sephora

虽然这对 Sephora 来说已经足够了，但还有另一个问题——Ulta 页面中的价格加载了 js ，因此价格span的父级为空。

3.【建议解决方案】提取JSON inside `script` Tags

对于这两个站点，产品数据都可以在script标签内找到，因此这个 function 可用于从任一站点提取价格：

# import json

############ LONGER  VERSION ##########
def getPrice_fromScript(scriptTag):
    try:
        s, sj = scriptTag.get_text(), json.loads(scriptTag.get_text())
        while s:
            sPair = s.split('"@type"', 1)[1].split(':', 1)[1].split(',', 1)
            t, s = sPair[0].strip(), sPair[1]
            try: 
                if t == '"Product"': return sj['offers']['price'] # Ulta
                elif t == '"Organization"': return sj['offers'][0]['price'] # Sephora
                # elif.... # can add more options
                # else.... # can add a default
            except: continue
    except: return None
#######################################

############ SHORTER VERSION ##########
def getPrice_fromScript(scriptTag):
    try:
        sj = json.loads(scriptTag.get_text())
        try: return sj['offers']['price'] # Ulta
        except: pass
        try: return sj['offers'][0]['price'] # Sephora
        except: pass 
        # try...except: pass # can try more options
    except: return None
#######################################

您可以将它与您的 BeautifulSoup 代码一起使用：

# from requests_html import HTMLSession # IF you use instead of requests

# def getPrice_fromScript....

for i in range(len(df)):
    response = requests.get(df['product_url'].iloc[i]) # takes too long [for me]
    # response = HTMLSession().get(df['product_url'].iloc[i]) # is faster [for me]

    ## error handing, just in case ##
    if response.status_code != 200:
        errorMsg = f'Failed to scrape [{response.status_code} {response.reason}] - '
        print(errorMsg, df['product_url'].iloc[i])
        continue # skip to next loop/url

    soup = BeautifulSoup(response.content, 'html.parser')
    
    pList = [p.strip() for p in [
        getPrice_fromScript(s) for s in soup.select('script[type="application/ld+json"]')[:5] # [1:2]
    ] if p and p.strip()]

    if pList: df['price'].iloc[i] = pList[0]

（价格应该在带有type="application/ld+json"的第二个script标签中，但这是搜索前 5 个以防万一....）

_{注意：当我测试这些代码时， requests.get非常慢，尤其是对于 Sephora，所以我最终改用HTMLSession().get 。}

在 Beautifulsoup 中使用 xpath 解析 lxml 时没有得到 output

问题描述

1 个解决方案

解决方案1
0 2022-12-14 05:39:00

1.关于XPath

[两个站点的编码 - Selenium]

2. bs4+requests 抓取的局限性

3.【建议解决方案】提取JSON inside `script` Tags

在 Beautifulsoup 中使用 xpath 解析 lxml 时没有得到 output

问题描述

1 个解决方案

解决方案1 0 2022-12-14 05:39:00

1.关于XPath

[两个站点的编码 - Selenium]

2. bs4+requests 抓取的局限性

3.【建议解决方案】提取JSON inside script Tags

解决方案1
0 2022-12-14 05:39:00

3.【建议解决方案】提取JSON inside `script` Tags