I have written some fairly questionable code I'm sure, but it seems to do the job. The issue is that it is printing the data to a spreadsheet and in the column where I am hoping to find the vehicle's year if the first word in the advert isn't the year, then it displays the first word which could be the manufacturer.
Essentially i want to set if statements so that if the vehicle year isn't in the first word but is somewhere else in the string that it still finds it and prints it to my .csv.
Also, I have been struggling for a while to parse through multiple pages and was hoping that someone here could help with that too. The url has page=2 etc in it but I am not able to get it to parse through all url's and get the data on all pages. At the moment everything I have tried only does the first page. As you may have guessed, I am fairly new to Python.
import csv ; import requests
from bs4 import BeautifulSoup
outfile = open('carandclassic-new.csv','w', newline='', encoding='utf-8')
writer = csv.writer(outfile)
writer.writerow(["Link", "Title", "Year", "Make", "Model", "Variant", "Image"])
url = 'https://www.carandclassic.co.uk/cat/3/?page=2'
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, 'html.parser')
car_link = soup.find_all('div', 'titleAndText', 'image')
for div in car_link:
links = div.findAll('a')
for a in links:
link = ("https://www.carandclassic.co.uk" + a['href'])
title = (a.text.strip())
year = (title.split(' ', 1)[0])
make = (title.split(' ', 2)[1])
model = (title.split(' ', 3)[2])
date = "\d"
for line in title:
yom = title.split()
if yom[0] == "\d":
yom[0] = (title.split(' ', 1)[0])
else:
yom = title.date
writer.writerow([link, title, year, make, model])
print(link, title, year, make, model)
outfile.close()
Please could someone help me with this? I realise that the if statements at the bottom may be way off.
The code successfully manages to get the first word from the string, it is just a shame that the way the data is structured that it isn't always the vehicle's year of manufacture (yom)
Comment
"1978 Full restored Datsun 280Z"
becomes'1978' '1978' '280Z'
.
Rather than'1978' 'Datsun' '280z'
To improve the year
validation, change to use the re
module:
import re
if not (len(year) == 4 and year.isdigit()):
match = re.findall('\d{4}', title)
if match:
for item in match:
if int(item) in range(1900,2010):
# Assume year
year = item
break
The output becomes:
'1978 Full restored Datsun 280Z', '1978', 'Full', '280Z'
About the false result make='Full'
you have two options.
Stop word list
Build a stop word list with terms like ['full', 'restored', etc.]
and loop
the title_items
to find the first item not in the stop word list.
Maker list
Build a Maker list like ['Mercedes', 'Datsun', etc.]
and loop
the title_items
to find the first matching item.
Question : find the vehicle's year if the first word in the advert isn't the year
Used build-in
and module
:
Sample Titles used:
# Simulating html Element class Element(): def __init__(self, text): self.text = text for a in [Element('Mercedes Benz 280SL 1980 Cabriolet in beautiful condition'), Element('1964 Mercedes Benz 220SEb Saloon Manual RHD')]:
Get the title
from <a
Element
and split it by blanks
.
title = a.text.strip() title_items = title.split()
Defaults are title_items
at index 0, 1, 2
.
# Default year = title_items[0] make = title_items[1] model = title_items[2]
Verify if the year
met the condition 4 digits
# Verify 'year' if not (len(year) == 4 and year.isdigit()):
Loop all item
in title_items
, break if condition met.
# Test all items for item in title_items: if len(item) == 4 and item.isdigit(): # Assume year year = item break
Change to assumed, title_items
at index 0, 1
are make
and model
make = title_items[0] model = title_items[1]
Check if model
starts with digit
Note : This will fail if a Model does not met this criteria!
# Condition: Model have to start with digit if not model[0].isdigit(): for item in title_items: if item[0].isdigit() and not item == year: model = item print('{}'.format([title, year, make, model]))
Output :
['Mercedes Benz 280SL 1980 Cabriolet in beautiful condition', '1980', 'Mercedes', '280SL'] ['1964 Mercedes Benz 220SEb Saloon Manual RHD', '1964', 'Mercedes', '220SEb']
Tested with Python: 3.4.2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.