简体   繁体   中英

My code using Beautiful Soup 4 does not display anything. What's wrong?

I am trying to do an assignment which is accessing school link and displaying all the phone numbers listed in it using beautiful soup.

The code below is what I got done so far. It runs with no problems, but the print function does not print the phone numbers. I think I may have a problem with the class selectors and the arguments for them, but i am not sure

import requests, bs4, re

res = requests.get('http://catalog.tri-c.edu/about/important-phone-numbers/')
res.raise_for_status()

catalogPage = bs4.BeautifulSoup(res.text, 'html.parser')

selectors = ['.column1''.column2''.column3''.column4']

for selector in selectors:

  elements = catalogPage.select('.column1''.column2''.column3''.column4')

  for element in elements:

    phoneRegex = re.compile(r'([0-9])\d\d\W\S\d\d\W\S\d\d\d')

    match = element.getText(phoneRegex)

    if match == None:
      continue

    print("Phone Number: ", match)

I found 3 errors which I'v fixed below:

  1. You selectors must be written as a list, each item must have a comma separating it from the next.
  2. Your regex pattern wasn't correct
  3. To use regex to find text in bs4, you must specify that in the parameters element.find_all(text = phoneRegex)

Here's the corrected code:

# This must be a list
selectors = ['.column1','.column2','.column3','.column4']

for selector in selectors:

    # This will return a list for each selector
    elements = catalogPage.select(selector)


    for element in elements:

        # Fix the regex pattern
        phoneRegex = re.compile(r'[0-9]{3}\-[0-9]{3}\-[0-9]{4}')
        match = element.find_all(text = phoneRegex)

        if not match:
            continue

        # Otherwise
        print(f"Phone Number: {match[0]}")

I updated your code and added some comments so you understand better what was wrong:

import requests, bs4, re

res = requests.get('http://catalog.tri-c.edu/about/important-phone-numbers/')
res.raise_for_status()

catalogPage = bs4.BeautifulSoup(res.text, 'html.parser')

# Your version of selectors (this is actually a 
# list of only one concatenated string):
selectors = ['.column1''.column2''.column3''.column4']
# This prints: ['.column1.column2.column3.column4']
print(selectors)

# You missed the commas between the selectors, 
# this will give you an actual list:
selectors = ['.column1','.column2','.column3','.column4']
# This prints instead ['.column1', '.column2', '.column3', '.column4']
print(selectors)

for selector in selectors:
    # You want to select one selector at the time only:
    elements = catalogPage.select(selector)

    for element in elements:
        phoneRegex = re.compile(r'([0-9])\d\d\W\S\d\d\W\S\d\d\d')

        # The regex is not an actual argument to element.getText().
        # At first, you want to get the text from the element node.
        # At second, you want to check whether it matches your phone
        # regex.
        match = phoneRegex.match(element.getText())

        # In python, one compares None objects with the `is` operator.
        if match is None:
            continue

        print("Phone Number: ", match.group())
import pandas as pd


df = pd.read_html("http://catalog.tri-c.edu/about/important-phone-numbers/")[0]

df.to_csv("data.csv", index=False)

Output: view-online

在此处输入图片说明

pandas can be accessed as a list :

print(df["Eastern Campus"].to_list())
print(df["Metropolitan Campus"].to_list())
print(df["Western Campus"].to_list())
print(df["Westshore Campus"].to_list())

Output:

['216-987-2226', '216-987-2256', '216-987-2070', '216-987-4325', '216-987-2567', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-0595', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-2045', '216-987-2230', '216-987-2343', '216-987-2013']     
['216-987-4225', '216-987-4311', '216-987-4550', '216-987-4325', '216-987-4913', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-4292', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-4610', '216-987-4290', '216-987-4253', '216-987-6137']     
['216-987-5227', '216-987-5256', '216-987-5550', '216-987-4325', '216-987-5575', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-5656', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-5428', '216-987-5079', '216-987-5683', '216-987-5204']     
['216-987-5588', '216-987-3888', '216-987-3908', '216-987-4325', '216-987-2067', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-3888', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-5929', '216-987-5732', '216-987-5902', '216-987-3536']  

bs4 usage as i see there's no reason to use regex at all:

from bs4 import BeautifulSoup
import requests

r = requests.get("http://catalog.tri-c.edu/about/important-phone-numbers/")
soup = BeautifulSoup(r.text, 'html.parser')


column1 = [item.text for item in soup.findAll("td", class_="column1")]

print(column1)

Output:

['216-987-2226', '216-987-2256', '216-987-2070', '216-987-4325', '216-987-2567', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-0595', '216-987-6000', '216-987-6000', '216-987-6000', '216-987-2045', '216-987-2230', '216-987-2343', '216-987-2013', '216-987-3075', '216-987-3075', '216-987-3075']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM