I am trying to do an assignment which is accessing school link and displaying all the phone numbers listed in it using beautiful soup.
The code below is what I got done so far. It runs with no problems, but the print function does not print the phone numbers. I think I may have a problem with the class selectors and the arguments for them, but i am not sure
import requests, bs4, re
res = requests.get('http://catalog.tri-c.edu/about/important-phone-numbers/')
res.raise_for_status()
catalogPage = bs4.BeautifulSoup(res.text, 'html.parser')
selectors = ['.column1''.column2''.column3''.column4']
for selector in selectors:
elements = catalogPage.select('.column1''.column2''.column3''.column4')
for element in elements:
phoneRegex = re.compile(r'([0-9])\d\d\W\S\d\d\W\S\d\d\d')
match = element.getText(phoneRegex)
if match == None:
continue
print("Phone Number: ", match)
I found 3 errors which I'v fixed below:
selectors
must be written as a list, each item must have a comma separating it from the next.element.find_all(text = phoneRegex)
Here's the corrected code:
# This must be a list
selectors = ['.column1','.column2','.column3','.column4']
for selector in selectors:
# This will return a list for each selector
elements = catalogPage.select(selector)
for element in elements:
# Fix the regex pattern
phoneRegex = re.compile(r'[0-9]{3}\-[0-9]{3}\-[0-9]{4}')
match = element.find_all(text = phoneRegex)
if not match:
continue
# Otherwise
print(f"Phone Number: {match[0]}")
I updated your code and added some comments so you understand better what was wrong:
import requests, bs4, re
res = requests.get('http://catalog.tri-c.edu/about/important-phone-numbers/')
res.raise_for_status()
catalogPage = bs4.BeautifulSoup(res.text, 'html.parser')
# Your version of selectors (this is actually a
# list of only one concatenated string):
selectors = ['.column1''.column2''.column3''.column4']
# This prints: ['.column1.column2.column3.column4']
print(selectors)
# You missed the commas between the selectors,
# this will give you an actual list:
selectors = ['.column1','.column2','.column3','.column4']
# This prints instead ['.column1', '.column2', '.column3', '.column4']
print(selectors)
for selector in selectors:
# You want to select one selector at the time only:
elements = catalogPage.select(selector)
for element in elements:
phoneRegex = re.compile(r'([0-9])\d\d\W\S\d\d\W\S\d\d\d')
# The regex is not an actual argument to element.getText().
# At first, you want to get the text from the element node.
# At second, you want to check whether it matches your phone
# regex.
match = phoneRegex.match(element.getText())
# In python, one compares None objects with the `is` operator.
if match is None:
continue
print("Phone Number: ", match.group())
import pandas as pd
df = pd.read_html("http://catalog.tri-c.edu/about/important-phone-numbers/")[0]
df.to_csv("data.csv", index=False)
Output: view-online
pandas
can be accessed as a list
:
print(df["Eastern Campus"].to_list())
print(df["Metropolitan Campus"].to_list())
print(df["Western Campus"].to_list())
print(df["Westshore Campus"].to_list())
Output:
['216-987-2226', '216-987-2256', '216-987-2070', '216-987-4325', '216-987-2567', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-0595', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-2045', '216-987-2230', '216-987-2343', '216-987-2013']
['216-987-4225', '216-987-4311', '216-987-4550', '216-987-4325', '216-987-4913', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-4292', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-4610', '216-987-4290', '216-987-4253', '216-987-6137']
['216-987-5227', '216-987-5256', '216-987-5550', '216-987-4325', '216-987-5575', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-5656', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-5428', '216-987-5079', '216-987-5683', '216-987-5204']
['216-987-5588', '216-987-3888', '216-987-3908', '216-987-4325', '216-987-2067', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-3888', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-5929', '216-987-5732', '216-987-5902', '216-987-3536']
bs4
usage as i see there's no reason to use regex
at all:
from bs4 import BeautifulSoup
import requests
r = requests.get("http://catalog.tri-c.edu/about/important-phone-numbers/")
soup = BeautifulSoup(r.text, 'html.parser')
column1 = [item.text for item in soup.findAll("td", class_="column1")]
print(column1)
Output:
['216-987-2226', '216-987-2256', '216-987-2070', '216-987-4325', '216-987-2567', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-0595', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-2045', '216-987-2230', '216-987-2343', '216-987-2013', '216-987-3075', '216-987-3075', '216-987-3075']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.