[英]Data cleaning while Web-scraping using Beautiful soup
import requests, re
from bs4 import BeautifulSoup
r = requests.get('https://www.nrtcfresh.com/products/whole/vegetables-whole', headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
soup=BeautifulSoup(c,"html.parser")
#print(soup.prettify())
all = soup.find_all("div",{"class":"col-sm-3 nrtc-p-10"})
all[1].find("h4").text
The output is provided below output 提供如下
'\r\n Tomatoes\t\t\t\t (Turkey)\n'
To get "Turkey" as the output, I can all[1].find('h4').find("span").text.replace(" ", "").replace("(","").replace(")","")
Is there a better way to write this code and more importantly, how do I get just "Tomatoes" as the output?要获得“土耳其”作为 output,我可以all[1].find('h4').find("span").text.replace(" ", "").replace("(","").replace(")","")
有没有更好的方法来编写这段代码,更重要的是,我如何才能得到像 output 这样的“西红柿”?
<h4> " Tomatoes " <span>(Turkey)</span> </h4>
This is one way:这是一种方式:
import requests
from bs4 import BeautifulSoup
countries = []
vegetables = []
remove = ['(', ')']
r = requests.get('https://www.nrtcfresh.com/products/whole/vegetables-whole', headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
soup=BeautifulSoup(c,"html.parser")
text = ''
all = soup.select("div.col-sm-3.nrtc-p-10 h4")
# Vegetables
print('Vegetables:\n')
for vegetable in all:
print(vegetable.find(text=True, recursive=False).strip())
vegetables.append(vegetable.find(text=True, recursive=False).strip())
# Countries:
print('\n\nCountries:\n')
for span in all:
for t in span.find('span').get_text(strip=True):
if not t in remove:
text += t
print(text)
countries.append(text)
text= ''
# Vegetables and Countries
for v, c in zip(vegetables, countries):
print(f'{v} - {c}')
prints:印刷:
Vegetables:
Tapioca
Tomatoes
Rosemary
Beef Tomatoes
Red Cherry Tomatoes
Red Cherry Tomatoes (Vine)
Yellow Cherry Tomatoes
Plum Tomatoes
Plum Cherry Tomatoes
Vine Tomatoes
....
Countries:
Srilanka
Turkey
Kenya
Holland
Netherland
Netherland
Netherland
Netherland
Holland
Netherland
....
Tapioca - Srilanka
Tomatoes - Turkey
Rosemary - Kenya
Beef Tomatoes - Holland
Red Cherry Tomatoes - Netherland
Red Cherry Tomatoes (Vine) - Netherland
Yellow Cherry Tomatoes - Netherland
Plum Tomatoes - Netherland
Plum Cherry Tomatoes - Holland
Vine Tomatoes - Netherland
Turnip - Iran
Baby Turnip - South Africa
Yams (Suran) - India
Green Baby Zucchini - South Africa
....
Note: i have shorten the print here..注意:我在这里缩短了打印..
This method is especially good if there are many different characters that are not accepted如果有很多不同的字符不被接受,这种方法特别好
My understanding is that you're looking for the vegetable name only, not the country.我的理解是,您仅在寻找蔬菜名称,而不是国家/地区。 If you are happy to dispose of the country name you can do the following:如果您愿意处理国家/地区名称,您可以执行以下操作:
# Delete the country spans
for span in soup.select("div.col-sm-3.nrtc-p-10 h4 span"):
span.extract()
# Get a list of all the vegetables
veg_list = [h4.text.strip() for h4 in soup.select("div.col-sm-3.nrtc-p-10 h4")]
print(veg_list)
Tapioca
Tomatoes
Rosemary
Beef Tomatoes
Red Cherry Tomatoes
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.