简体   繁体   中英

Web Scraping links

I am working on scraping links from a Christmas tree farm website. First, I used this tutorial method to get all the links. Then, I noticed that the links that I wanted did not lead with the proper hypertext transfer protocol, so I created a variable to concatenate. Now I am trying to create a if statement that grabs each link and looks for any two characters followed by "xmastrees.php". If that is true then my concatenate variable to the front of it. If the link does not contain the specific text then it is deleted. For example NYxmastrees.php will be http://www.pickyourownchristmastree.org/NYxmastrees.php and ../disclaimer.htm will be removed. I've tried multiple ways, but can't seem to find the right one.

Here is what I currently have and keep running into a syntax error: del. I commented out that line and get another error saying my string object has no attribute 're'. This confuses me because I though i could use regex with strings??

source = requests.get('http://www.pickyourownchristmastree.org/').text
soup = BeautifulSoup(source, 'lxml')
concatenate = 'http://www.pickyourownchristmastree.org/'

find_state_group = soup.find('div', class_ = 'alert')
for link in find_state_group.find_all('a', href=True):
    if link['href'].re.search('^.\B.\$xmastrees'):
        states = concatenate + link
    else del link['href']
    print(link['href']

Error with else del link['href'] :

    else del link['href']
           ^
SyntaxError: invalid syntax

Error without else del link['href'] :

    if link['href'].re.search('^.\B.\$xmastrees'):
AttributeError: 'str' object has no attribute 're'

You can try using:

import requests
from bs4 import BeautifulSoup as bs

u = "http://www.pickyourownchristmastree.org/"
soup = bs(requests.get(u).text, 'html5lib')

find_state_group = soup.find('div', {"class": 'alert'})
for link in find_state_group.find_all('a', href=True):
    if "mastrees" in link['href']:
        states = u + link['href']
        print(states)

http://www.pickyourownchristmastree.org/ALxmastrees.php
http://www.pickyourownchristmastree.org/AZxmastrees.php
http://www.pickyourownchristmastree.org/AKxmastrees.php
...

Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM