简体   繁体   中英

Remove unwanted characters from string with BeautifulSoup Python when selecting words from string

I am new to Python and still don't understand all of it and its functionality but I am getting close to what I am trying to achieve.

Essentially I have got the programme to scrape the data I want from the website but when it is printing selected words/items from the "specs" string it is also printing characters such as [ ] and '' from the string.

The example is that I am trying to just get the 'gearbox' type, 'fuel' type and 'mileage' from a list of li's which i have converted to a string with the plant to then select the specific item from that string.

What I am getting with the current programme is this:

['Manual']['Petrol']['86,863 miles']

What I would like to achieve is a printed result like this:

Manual, Petrol, 86,863 miles

Which when exported to separate columns in my .csv should show up in their correct columns under the appropriate headings.

I have tried .text to remove just the text but it shows up with the 'list' object has no attribute 'text' error.


import csv 

import requests
from bs4 import BeautifulSoup

outfile = open('pistonheads.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Link", "Make", "Model", "Price", "Image Link", 
"Gearbox", "Fuel", "Mileage"])

url = 'https://www.pistonheads.com/classifieds?Category=used- cars&Page=1&ResultsPerPage=100'

get_url = requests.get(url)
get_text = get_url.text

soup = BeautifulSoup(get_text, 'html.parser')
car_link = soup.find_all('div', 'listing-headline', 'price')

for div in car_link:
    links = div.findAll('a')
    for a in links:
        link = ("https://www.pistonheads.com" + a['href'])
        make = (a['href'].split('/')[-4])
        model = (a['href'].split('/')[-3])
        price = a.find('span').text.rstrip()
        image_link = a.parent.parent.find('img')['src']
        image = ("https:") + image_link
        vehicle_details = a.parent.parent.find('ul', class_='specs')
        specs = list(vehicle_details.stripped_strings)
        gearbox = specs[3:]
        fuel = specs[1:2]
        mileage = specs[0:1]
        writer.writerow([link, make, model, price, image, gearbox, fuel, mileage])
        print(link, make, model, price, image, gearbox, fuel, mileage)

outfile.close()

Welcome to StackOverflow!

So there's a lot to improve from your script. You are getting there!

  • specs = list(vehicle_details.stripped_strings) is a generator resolved into a list. Effectively, you get to access by index the things you want. For example, mileage can simply be specs[0] .
  • The issue that you get extra [ and ] is caused by your use of slicing mileage = specs[0:1] . From the documentation, indexing returns an item, slicing returns a new list . See lists introduction .
  • (Optional) And finally, to get all those information in a single line, you can do multiple assignments from the specs list. See multiple assignments.
mileage, fuel, _, gearbox = specs
  • Bonus tip When in doubt, use pdb .
mileage = specs[0]
import pdb; pdb.set_trace()  # temp set on one line so you can remove it easily after
# now you can interactively inspect your code
(Pdb) specs

Good luck! And enjoy Python!

if you want to get the string from the list maybe you can do this

gearbox = specs[3:][0] if specs[3:] else '-'
fuel = specs[1:2][0]  if specs[1:2] else '-'
mileage = specs[0:1][0]  if specs[0:1] else '-' 

but this way or aldnav answer will give false result even throw an error

ValueError: not enough values to unpack

Usually I will extract parent container first, not select the child ( a ) then go to the parent.

# helper to get dynamic specs element
def getSpec(element, selector):
    spec = element.select_one(selector)
    return spec.nextSibling.string.strip() if spec else '-'

soup = BeautifulSoup(get_text, 'html.parser')
results = soup.find_all('div', class_="result-contain")

for car in results:
    a = car.find('a')
    if not a:
        continue
    link = ("https://www.pistonheads.com" + a['href'])
    make = (a['href'].split('/')[-4])
    model = (a['href'].split('/')[-3])
    price = a.find('span').text.rstrip()
    image_link = car.find('img')['src']
    image = ("https:") + image_link

    if not car.find('ul', class_='specs'):
        gearbox = fuel = mileage = '-'
    else:
        gearbox = getSpec(car, '.location-pin-4')
        fuel = getSpec(car, '.gas-1')
        mileage = getSpec(car, '.gauge-1')
    print(gearbox, fuel, mileage)
    writer.writerow([link, make, model, price, image, gearbox, fuel, mileage])
    #print(link, make, model, price, image, gearbox, fuel, mileage)

outfile.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM