I am new to Python and still don't understand all of it and its functionality but I am getting close to what I am trying to achieve.
Essentially I have got the programme to scrape the data I want from the website but when it is printing selected words/items from the "specs" string it is also printing characters such as [ ] and '' from the string.
The example is that I am trying to just get the 'gearbox' type, 'fuel' type and 'mileage' from a list of li's which i have converted to a string with the plant to then select the specific item from that string.
What I am getting with the current programme is this:
['Manual']['Petrol']['86,863 miles']
What I would like to achieve is a printed result like this:
Manual, Petrol, 86,863 miles
Which when exported to separate columns in my .csv should show up in their correct columns under the appropriate headings.
I have tried .text to remove just the text but it shows up with the 'list' object has no attribute 'text' error.
import csv
import requests
from bs4 import BeautifulSoup
outfile = open('pistonheads.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Link", "Make", "Model", "Price", "Image Link",
"Gearbox", "Fuel", "Mileage"])
url = 'https://www.pistonheads.com/classifieds?Category=used- cars&Page=1&ResultsPerPage=100'
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, 'html.parser')
car_link = soup.find_all('div', 'listing-headline', 'price')
for div in car_link:
links = div.findAll('a')
for a in links:
link = ("https://www.pistonheads.com" + a['href'])
make = (a['href'].split('/')[-4])
model = (a['href'].split('/')[-3])
price = a.find('span').text.rstrip()
image_link = a.parent.parent.find('img')['src']
image = ("https:") + image_link
vehicle_details = a.parent.parent.find('ul', class_='specs')
specs = list(vehicle_details.stripped_strings)
gearbox = specs[3:]
fuel = specs[1:2]
mileage = specs[0:1]
writer.writerow([link, make, model, price, image, gearbox, fuel, mileage])
print(link, make, model, price, image, gearbox, fuel, mileage)
outfile.close()
Welcome to StackOverflow!
So there's a lot to improve from your script. You are getting there!
specs = list(vehicle_details.stripped_strings)
is a generator resolved into a list. Effectively, you get to access by index the things you want. For example, mileage
can simply be specs[0]
. [
and ]
is caused by your use of slicing mileage = specs[0:1]
. From the documentation, indexing returns an item, slicing returns a new list . See lists introduction . mileage, fuel, _, gearbox = specs
mileage = specs[0]
import pdb; pdb.set_trace() # temp set on one line so you can remove it easily after
# now you can interactively inspect your code
(Pdb) specs
Good luck! And enjoy Python!
if you want to get the string from the list maybe you can do this
gearbox = specs[3:][0] if specs[3:] else '-'
fuel = specs[1:2][0] if specs[1:2] else '-'
mileage = specs[0:1][0] if specs[0:1] else '-'
but this way or aldnav answer will give false result even throw an error
ValueError: not enough values to unpack
Usually I will extract parent container first, not select the child ( a
) then go to the parent.
# helper to get dynamic specs element
def getSpec(element, selector):
spec = element.select_one(selector)
return spec.nextSibling.string.strip() if spec else '-'
soup = BeautifulSoup(get_text, 'html.parser')
results = soup.find_all('div', class_="result-contain")
for car in results:
a = car.find('a')
if not a:
continue
link = ("https://www.pistonheads.com" + a['href'])
make = (a['href'].split('/')[-4])
model = (a['href'].split('/')[-3])
price = a.find('span').text.rstrip()
image_link = car.find('img')['src']
image = ("https:") + image_link
if not car.find('ul', class_='specs'):
gearbox = fuel = mileage = '-'
else:
gearbox = getSpec(car, '.location-pin-4')
fuel = getSpec(car, '.gas-1')
mileage = getSpec(car, '.gauge-1')
print(gearbox, fuel, mileage)
writer.writerow([link, make, model, price, image, gearbox, fuel, mileage])
#print(link, make, model, price, image, gearbox, fuel, mileage)
outfile.close()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.