[英]Python BeautifulSoup accounting for missing data on website when writing to csv
I am practicing my web scraping skills on the following website: " http://web.californiacraftbeer.com/Brewery-Member " 我正在以下网站上练习我的网页抓取技巧:“ http://web.californiacraftbeer.com/Brewery-Member ”
The code I have so far is below. 我到目前为止的代码如下。 I'm able to grab the fields that I want and write the information to CSV, but the information in each row does not match the actual company details.
我可以获取所需的字段并将信息写入CSV,但是每行中的信息与实际的公司详细信息不匹配。 For example, Company A has the contact name for Company D and the phone number for Company E in the same row.
例如,公司A在同一行中具有公司D的联系人姓名和公司E的电话号码。
Since some data does not exist for certain companies, how can I account for this when writing rows that should be separated per company to CSV? 由于某些公司不存在某些数据,因此当将应按公司分开的行写入CSV时,该如何处理? What is the best way to make sure that I am grabbing the correct information for the correct companies when writing to CSV?
确保在写入CSV时能为正确的公司获取正确的信息的最佳方法是什么?
"""
Grabs brewery name, contact person, phone number, website address, and email address
for each brewery listed.
"""
import requests, csv
from bs4 import BeautifulSoup
url = "http://web.californiacraftbeer.com/Brewery-Member"
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
company_name = soup.find_all(itemprop="name")
contact_name = soup.find_all("div", {"class": "ListingResults_Level3_MAINCONTACT"})
phone_number = soup.find_all("div", {"class": "ListingResults_Level3_PHONE1"})
website = soup.find_all("span", {"class": "ListingResults_Level3_VISITSITE"})
def scraper():
"""Grabs information and writes to CSV"""
print("Running...")
results = []
count = 0
for company, name, number, site in zip(company_name, contact_name, phone_number, website):
print("Grabbing {0} ({1})...".format(company.text, count))
count += 1
newrow = []
try:
newrow.append(company.text)
newrow.append(name.text)
newrow.append(number.text)
newrow.append(site.find('a')['href'])
except Exception as e:
error_msg = "Error on {0}-{1}".format(number.text,e)
newrow.append(error_msg)
results.append(newrow)
print("Done")
outFile = open("brewery.csv","w")
out = csv.writer(outFile, delimiter=',',quoting=csv.QUOTE_ALL, lineterminator='\n')
out.writerows(results)
outFile.close()
def main():
"""Runs web scraper"""
scraper()
if __name__ == '__main__':
main()
Any help is very much appreciated! 很感谢任何形式的帮助!
您需要使用一个zip
来同时遍历所有这些数组:
for company, name, number, site in zip(company_name, contact_name, phone_number, website):
Thanks for the help. 谢谢您的帮助。
I realized that since the company details for each company are contained in the Div class "ListingResults_All_CONTAINER ListingResults_Level3_CONTAINER", I could write a nested for-loop that iterates through each of these Divs and then grabs the information I want within the Div. 我意识到,由于每个公司的公司详细信息都包含在Div类“ ListingResults_All_CONTAINER ListingResults_Level3_CONTAINER”中,因此我可以编写一个嵌套的for循环,遍历每个Divs,然后在Div中获取我想要的信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.