简体   繁体   中英

How do I parse through string looking for specific word/digits and display them if found

I have written some fairly questionable code I'm sure, but it seems to do the job. The issue is that it is printing the data to a spreadsheet and in the column where I am hoping to find the vehicle's year if the first word in the advert isn't the year, then it displays the first word which could be the manufacturer.

Essentially i want to set if statements so that if the vehicle year isn't in the first word but is somewhere else in the string that it still finds it and prints it to my .csv.

Also, I have been struggling for a while to parse through multiple pages and was hoping that someone here could help with that too. The url has page=2 etc in it but I am not able to get it to parse through all url's and get the data on all pages. At the moment everything I have tried only does the first page. As you may have guessed, I am fairly new to Python.

import csv ; import requests

from bs4 import BeautifulSoup

outfile = open('carandclassic-new.csv','w', newline='', encoding='utf-8')
writer = csv.writer(outfile)
writer.writerow(["Link", "Title", "Year", "Make", "Model", "Variant", "Image"])

url = 'https://www.carandclassic.co.uk/cat/3/?page=2'

get_url = requests.get(url)

get_text = get_url.text

soup = BeautifulSoup(get_text, 'html.parser')


car_link = soup.find_all('div', 'titleAndText', 'image')


for div in car_link:
    links = div.findAll('a')
    for a in links:
        link = ("https://www.carandclassic.co.uk" + a['href'])
        title = (a.text.strip())
        year = (title.split(' ', 1)[0])
        make = (title.split(' ', 2)[1])
        model = (title.split(' ', 3)[2])
        date = "\d"
        for line in title:
        yom = title.split()
        if yom[0] == "\d":
            yom[0] = (title.split(' ', 1)[0])
        else:
            yom = title.date

        writer.writerow([link, title, year, make, model])
        print(link, title, year, make, model)



outfile.close()

Please could someone help me with this? I realise that the if statements at the bottom may be way off.

The code successfully manages to get the first word from the string, it is just a shame that the way the data is structured that it isn't always the vehicle's year of manufacture (yom)

Comment "1978 Full restored Datsun 280Z" becomes '1978' '1978' '280Z' .
Rather than '1978' 'Datsun' '280z'

To improve the year validation, change to use the re module:

import re

if not (len(year) == 4 and year.isdigit()):
    match = re.findall('\d{4}', title)
    if match:
        for item in match:
            if int(item) in range(1900,2010):
                # Assume year
                year = item
                break

The output becomes:

 '1978 Full restored Datsun 280Z', '1978', 'Full', '280Z' 

About the false result make='Full' you have two options.

  1. Stop word list
    Build a stop word list with terms like ['full', 'restored', etc.] and loop the title_items to find the first item not in the stop word list.

  2. Maker list
    Build a Maker list like ['Mercedes', 'Datsun', etc.] and loop the title_items to find the first matching item.


Question : find the vehicle's year if the first word in the advert isn't the year

Used build-in and module :


  • Sample Titles used:

     # Simulating html Element class Element(): def __init__(self, text): self.text = text for a in [Element('Mercedes Benz 280SL 1980 Cabriolet in beautiful condition'), Element('1964 Mercedes Benz 220SEb Saloon Manual RHD')]: 
  • Get the title from <a Element and split it by blanks .

      title = a.text.strip() title_items = title.split() 
  • Defaults are title_items at index 0, 1, 2 .

      # Default year = title_items[0] make = title_items[1] model = title_items[2] 
  • Verify if the year met the condition 4 digits

      # Verify 'year' if not (len(year) == 4 and year.isdigit()): 
  • Loop all item in title_items , break if condition met.

      # Test all items for item in title_items: if len(item) == 4 and item.isdigit(): # Assume year year = item break 
  • Change to assumed, title_items at index 0, 1 are make and model

      make = title_items[0] model = title_items[1] 
  • Check if model starts with digit

    Note : This will fail if a Model does not met this criteria!

      # Condition: Model have to start with digit if not model[0].isdigit(): for item in title_items: if item[0].isdigit() and not item == year: model = item print('{}'.format([title, year, make, model])) 

Output :

 ['Mercedes Benz 280SL 1980 Cabriolet in beautiful condition', '1980', 'Mercedes', '280SL'] ['1964 Mercedes Benz 220SEb Saloon Manual RHD', '1964', 'Mercedes', '220SEb'] 

Tested with Python: 3.4.2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM