简体   繁体   English

使用 BeautifulSoup 抓取,值不干净

[英]Scraping using BeautifulSoup, value is not clean

I'm trying to scrape a nutrient tag ( http://smartlabel.generalmills.com/41196891218 ).我正在尝试刮掉营养标签( http://smartlabel.generalmills.com/41196891218 )。 and I'm having a hard time getting a clean gram value for each category.而且我很难为每个类别获得一个干净的克值。

For example, this is how it comes out for fat ('fat': '\n 1 g\n ',)\例如,脂肪的结果是这样的 ('fat': '\n 1 g\n ',)\

Any way to get something like this("fat": 1g)?有什么办法可以得到这样的东西(“fat”:1g)?

I just started learning bs4 yesterday, any help will be appreciated..我昨天刚开始学习bs4,任何帮助将不胜感激..

My code is我的代码是

def minenutrition1(link):
    driver = webdriver.Chrome()
    driver.get(link)
    # noticed there is an ad here, sleep til page fully loaded.
    time.sleep(1)
    soup = BeautifulSoup(driver.page_source)
    driver.quit()
    calories=soup.find_all("span",{"class":"header2"})[0].text
    fat=soup.find_all("span",{"class":"gram-value"})[0].text
    satfat=soup.find_all("span",{"class":"gram-value"})[1].text
    cholesterol=soup.find_all("span",{"class":"gram-value"})[3].text
    sodium=soup.find_all("span",{"class":"gram-value"})[4].text
    carb=soup.find_all("span",{"class":"gram-value"})[5].text
    Total_sugar=soup.find_all("span",{"class":"gram-value"})[7].text
    protein=soup.find_all("span",{"class":"gram-value"})[9].text
    name = soup.find_all('div',{'class': 'product-header-name header1'})[0].text
    upc=soup.find_all("div",{"class":"upc sub-header"})
    upc=upc[0].text

You get normal string "\n 1 g\n " so you can use string functions to clean/change it.您会得到普通字符串"\n 1 g\n " ,因此您可以使用字符串函数来清理/更改它。

Using "\n 1 g\n ".strip() you can get "1 g"使用"\n 1 g\n ".strip()你可以得到"1 g"

So you can add .strip() at the end of this line所以你可以在这一行的末尾添加.strip()

fat = soup.find_all("span",{"class":"gram-value"})[0].text.strip()

or do it later或稍后再做

fat = fat.strip()

BS has also function .get_text(strip=True) which you can use instead of .text BS还有 function .get_text(strip=True)你可以用它来代替.text

fat = soup.find_all("span",{"class":"gram-value"})[0].get_text(strip=True)

Minimal working code.最少的工作代码。

I display fat with > < to see if there are spaces, tabs, enters (new lines).我用> <显示fat以查看是否有空格、制表符、输入(新行)。

from selenium import webdriver
from bs4 import BeautifulSoup
import time

url = 'http://smartlabel.generalmills.com/41196891218'
driver = webdriver.Chrome()
#driver = webdriver.Firefox()
driver.get(url)

# noticed there is an ad here, sleep til page fully loaded.
time.sleep(1)

soup = BeautifulSoup(driver.page_source)
driver.quit()

items = soup.find_all("span", {"class": "gram-value"})

fat = items[0].text
print('>{}<'.format(fat))

fat = items[0].text.strip()
print('>{}<'.format(fat))

fat = items[0].get_text(strip=True)
print('>{}<'.format(fat))

Result:结果:

>
                                    1 g
                                <
>1 g<
>1 g<

For this, I would not use Selenium.为此,我不会使用 Selenium。 Not that you can't, but the site is static, and you can get the html source straight away with requests .不是你不能,而是该站点是 static,你可以通过requests立即获得 html 源。 So this is a little bit of a stretch since you are beginning with BeautifulSoup, but if you open Dev Tools (Ctrl-Shift-I) and reload the page, you will notice the requests made in the right panel under Network -> XHR.所以这有点牵强,因为您从 BeautifulSoup 开始,但是如果您打开开发工具 (Ctrl-Shift-I) 并重新加载页面,您会注意到右侧面板中网络 -> XHR 下的请求。 There is a requeset to GetNutritionalDetails.有一个 GetNutritionalDetails 的请求。

Withibn there, you'll see the request url, and the the requests headers, and at the bottom the payload.在那里,您将看到请求 url 和请求标头,并在底部看到有效负载。 You will also see it's a POST request (usually you'll use GET .您还将看到它是一个POST请求(通常您将使用GET

在此处输入图像描述

The data is within a list ( <li> tags).数据在列表中( <li>标签)。 So it's not just a mater of getting all those tags, then iterate through each of those, to pull out the other data.因此,这不仅仅是获取所有这些标签,然后遍历每个标签以提取其他数据的问题。

You can append that data into a list, and then that list into a table/dataframe with pandas.您可以将该数据 append 放入一个列表中,然后将该列表放入一个带有 pandas 的表/数据框中。

Code:代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = 'http://smartlabel.generalmills.com/GTIN/GetNutritionalDetails'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}

payload = {
'id': '41196891218',
'servingSize': 'AS PACKAGED'}

response = requests.post(url, headers=headers, params=payload)

soup = BeautifulSoup(response.text, 'html.parser')
listItems = soup.find_all('li')

labels = []
gramValues = []
percValues = []

for each in listItems:
    label = each.find('label').text.strip()
    if label == 'Includes':
        label += ' Added Sugar'
    gram = each.find('span', {'class':'gram-value'}).text.strip()
    if each.find('span', {'class':'dv-result'}):
        perc = each.find('span', {'class':'dv-result'}).text.strip()
    else:
        perc = ''

    labels.append(label)
    gramValues.append(gram)
    percValues.append(perc)


df = pd.DataFrame({
        'Label':labels,
        'Grams':gramValues,
        'Percent':percValues})

Output: Output:

print (df)
                   Label   Grams Percent
0              Total Fat     1 g     1 %
1          Saturated Fat     0 g     0 %
2              Trans Fat     0 g        
3            Cholesterol    0 mg     0 %
4                 Sodium  810 mg    35 %
5     Total Carbohydrate    17 g     6 %
6          Dietary Fiber     2 g     6 %
7            Total Sugar     2 g        
8   Includes Added Sugar     2 g     3 %
9                Protein     4 g        
10             Vitamin D    0 ?g     0 %
11               Calcium   60 mg     4 %
12                  Iron  1.2 mg     6 %
13             Potassium    0 mg     0 %

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM