简体   繁体   中英

Website scraping with python 2.7 and beautifulsoup 4

I have stuck at one point while scraping website " http://www.queensbronxba.com/directory/ " with beautifulsoup. I'm almost done with scraping and I left only company name from the list which is found in paragraph tag. The problem is that there are more paragraph tags in the same div but I only need the first one as it gives the company name. So I need first paragraph on following div's also not just at first one. This is the code I used to srcape:

page = requests.get("http://www.queensbronxba.com/directory/")  
soup = BeautifulSoup(page.content, 'html.parser')  
company = soup.find(class_="boardMemberWrap")  
contact = company.find_all(class_="boardMember")  
info = contact[0]
print(info.prettify())

name_tags = company.select("h4")  
names = [nt.get_text() for nt in company_tags]  
names

company_tags = company.select("p")  #here I need help to get only first paragraphs of following div containers  
companies = [ct.get_text() for ct in company_tags]  
companies

phone_tags = company.select('a[href^="tel"]')  
phones = [pt.get_text() for pt in phone_tags]  
phones

email_tags = company.select('a[href^="mailto"]')  
emails = [et.get_text() for et in email_tags]  
emails
import requests
from bs4 import BeautifulSoup

page = requests.get("http://www.queensbronxba.com/directory/")
soup = BeautifulSoup(page.content, 'html.parser')  
company = soup.find(class_="boardMemberWrap")  
contact = company.findAll(class_="boardMemberInfo")
info = contact[0]
print(info.prettify())


name_tags = company.select("h4")
names = [nt.get_text() for nt in name_tags]
print(names)


for name in company.findAll(class_="boardMember"):
    for n in name.findAll('p')[:1]:
    print(n.text)


phone_tags = company.select('a[href^="tel"]')  
phones = [pt.get_text() for pt in phone_tags]  
print(phones)


email_tags = company.select('a[href^="mailto"]')  
emails = [et.get_text() for et in email_tags]  
print(emails)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM