简体   繁体   中英

isolate 'td a' tag based on class using beautiful soup

I'd like to write the url links in this url into a file but there are 2 'td a' tags for each line on the table. I just want the one where a class="pagelink" href="/search" etc.

I tried the following code, hoping to pick up only the ones where "class":"pagelink" , but produced an error:

AttributeError: 'Doctype' object has no attribute 'find_all'

Can anyone help please?

import requests
from bs4 import BeautifulSoup as soup
import csv

writer.writerow(['URL', 'Reference', 'Description', 'Address'])

url = https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=1000&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results

response = session.get(url)                 #not used until after the iteration begins
html = soup(response.text, 'lxml')

for link in html:
    prop_link = link.find_all("td a", {"class":"pagelink"})

    writer.writerow([prop_link])

Your html variable contains a Doctype object which is not iterable. You'll need to use find_all or select in that object to find the nodes that you want.

Example:

import requests
from bs4 import BeautifulSoup as soup
import csv

outputfilename = 'Ed_Streets2.csv'

#inputfilename = 'Edinburgh.txt'

baseurl = 'https://www.saa.gov.uk'

outputfile = open(outputfilename, 'wb')
writer = csv.writer(outputfile)
writer.writerow(['URL', 'Reference', 'Description', 'Address'])

session = requests.session()

url = "https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=100&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results"

response = session.get(url)              
html = soup(response.text, 'lxml')

prop_link = html.find_all("a", class_="pagelink button small")

for link in prop_link:
    prop_url = baseurl+(link["href"])
    print prop_url
    writer.writerow([prop_url, "", "", ""])

Try this.
You need to look for the links before starting the loop.

import requests
from bs4 import BeautifulSoup as soup
import csv

writer.writerow(['URL', 'Reference', 'Description', 'Address'])

url = "https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=1000&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results"

response = requests.get(url)                 #not used until after the iteration begins
html = soup(response.text, 'lxml')

prop_link = html.find_all("a", {"class":"pagelink button small"})

for link in prop_link:
    if(type(link) != type(None) and link.has_attr("href")):
        wr = link["href"]
        writer.writerow([wr])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM