简体   繁体   中英

Can't scrape certain fields having messy format from a webpage

I've written a script in python to get some items from a webpage. The thing is the content I wish to grab are not in tags, classes or ids separately. I'm only interested in address and phone . All of them are stacked in p tag. Given that I tried to gather them in the following manner.

site address

I've tried with:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://ams.contractpackaging.org/i4a/memberDirectory/?controller=memberDirectory&action=resultsDetail&directory_id=6&detail_lookup_id=90DB59F83AFA02C0'

res = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,'lxml')

address = soup.find(class_="memeberDirectory_details").find("p").text.split("Phone")[0].strip()
phone = soup.find(class_="memeberDirectory_details").find("p",text=re.compile("Phone:(.*)"))
print(address,phone)

This yields (address includes name which is not I want):

Assemblers Inc.

2850 West Columbus Ave.


Chicago IL 60652

UNITED STATES
None

Expected output:

2850 West Columbus Ave.
Chicago IL 60652
UNITED STATES

(773) 378-3000

You could try this code to extract address and phone:

import requests
from bs4 import BeautifulSoup
from itertools import takewhile

url = 'https://ams.contractpackaging.org/i4a/memberDirectory/?controller=memberDirectory&action=resultsDetail&directory_id=6&detail_lookup_id=90DB59F83AFA02C0'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

address_soup = soup.select_one('.memeberDirectory_details > p')

# remove company name in <b> tag
for b in address_soup.select('b'):
    b.extract()

data = [val.strip() for val in address_soup.get_text(separator='|').split('|') if val.strip()]

address = [*takewhile(lambda k: 'Phone:' not in k, data)]
phone = [val.replace('Phone:', '').strip() for val in data if 'Phone:' in val]

print('Address:')
print('\n'.join(address))
print()

print('Phone:')
print('\n'.join(phone))

Prints:

Address:
2850 West Columbus Ave.
Chicago IL 60652
UNITED STATES

Phone:
(773) 378-3000

EDIT:

To find text with regular expression, you could do this:

phone = soup.find(class_="memeberDirectory_details").find(text=re.compile("Phone:(.*)"))
print(phone)

Prints:

Phone: (773) 378-3000

Instead of finding and splitting at the <p> tag then finding each individual field, split at <p> and store all the <br> items in a list. If the elements of the lists don't change in size, you can always pop off the first element of the list. If you wish to go down your route, you can split the the address at the first instance of a number, but this would error out on company names that have a number in it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM