从字符串中选择单词时，使用BeautifulSoup Python从字符串中删除不需要的字符

Question

我是Python的新手，但仍然不了解所有内容及其功能，但是我已经接近要达到的目标。

从本质上讲，我有程序可以从网站上抓取我想要的数据，但是当它从“ specs”字符串中打印选定的单词/项目时，它也在从字符串中打印诸如[]和”之类的字符。

这个例子是我试图从li的列表中获取“ gearbox”类型，“ fuel”类型和“ mileage”，我已经用植物将其转换为字符串，然后从该字符串中选择特定项目。

我在当前程序中得到的是：

['Manual'] ['Petrol'] ['86,863 miles']

我想要实现的是这样的打印结果：

手动，汽油，86,863英里

将其导出到.csv中的单独列时，应在相应标题下的正确列中显示。

我试过.text只删除文本，但是显示出来的'list'对象没有属性'text'错误。


import csv 

import requests
from bs4 import BeautifulSoup

outfile = open('pistonheads.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Link", "Make", "Model", "Price", "Image Link", 
"Gearbox", "Fuel", "Mileage"])

url = 'https://www.pistonheads.com/classifieds?Category=used- cars&Page=1&ResultsPerPage=100'

get_url = requests.get(url)
get_text = get_url.text

soup = BeautifulSoup(get_text, 'html.parser')
car_link = soup.find_all('div', 'listing-headline', 'price')

for div in car_link:
    links = div.findAll('a')
    for a in links:
        link = ("https://www.pistonheads.com" + a['href'])
        make = (a['href'].split('/')[-4])
        model = (a['href'].split('/')[-3])
        price = a.find('span').text.rstrip()
        image_link = a.parent.parent.find('img')['src']
        image = ("https:") + image_link
        vehicle_details = a.parent.parent.find('ul', class_='specs')
        specs = list(vehicle_details.stripped_strings)
        gearbox = specs[3:]
        fuel = specs[1:2]
        mileage = specs[0:1]
        writer.writerow([link, make, model, price, image, gearbox, fuel, mileage])
        print(link, make, model, price, image, gearbox, fuel, mileage)

outfile.close()

Answer 1

欢迎来到StackOverflow！

因此，您的脚本有很多改进之处。 你到了那里！

specs = list(vehicle_details.stripped_strings)是解析为列表的生成器。 有效地，您可以通过索引访问所需的内容。 例如， mileage可以简单地是specs[0] 。
您得到额外的[和]是由您使用切片mileage = specs[0:1] 。 从文档中索引将返回一个项目，切片将返回一个新列表 。 请参阅清单简介。
（可选）最后，要在一行中获得所有这些信息，您可以从规格列表中进行多次分配。 查看多个作业。

mileage, fuel, _, gearbox = specs

奖励技巧如有疑问，请使用pdb 。

mileage = specs[0]
import pdb; pdb.set_trace()  # temp set on one line so you can remove it easily after
# now you can interactively inspect your code
(Pdb) specs

祝好运！ 享受Python！

Answer 2

如果您想从列表中获取字符串，也许可以这样做

gearbox = specs[3:][0] if specs[3:] else '-'
fuel = specs[1:2][0]  if specs[1:2] else '-'
mileage = specs[0:1][0]  if specs[0:1] else '-'

但是这种方式或aldnav的答案甚至会给出错误的结果甚至抛出错误

ValueError：没有足够的值可解压缩

通常我会先提取父容器，而不是选择子容器（ a ），然后再选择父容器。

# helper to get dynamic specs element
def getSpec(element, selector):
    spec = element.select_one(selector)
    return spec.nextSibling.string.strip() if spec else '-'

soup = BeautifulSoup(get_text, 'html.parser')
results = soup.find_all('div', class_="result-contain")

for car in results:
    a = car.find('a')
    if not a:
        continue
    link = ("https://www.pistonheads.com" + a['href'])
    make = (a['href'].split('/')[-4])
    model = (a['href'].split('/')[-3])
    price = a.find('span').text.rstrip()
    image_link = car.find('img')['src']
    image = ("https:") + image_link

    if not car.find('ul', class_='specs'):
        gearbox = fuel = mileage = '-'
    else:
        gearbox = getSpec(car, '.location-pin-4')
        fuel = getSpec(car, '.gas-1')
        mileage = getSpec(car, '.gauge-1')
    print(gearbox, fuel, mileage)
    writer.writerow([link, make, model, price, image, gearbox, fuel, mileage])
    #print(link, make, model, price, image, gearbox, fuel, mileage)

outfile.close()

从字符串中选择单词时，使用BeautifulSoup Python从字符串中删除不需要的字符

问题描述

2 个解决方案

解决方案1
2 2019-01-30 09:57:11

解决方案2
0 2019-01-30 15:29:34

从字符串中选择单词时，使用BeautifulSoup Python从字符串中删除不需要的字符

问题描述

2 个解决方案

解决方案1 2 2019-01-30 09:57:11

解决方案2 0 2019-01-30 15:29:34

解决方案1
2 2019-01-30 09:57:11

解决方案2
0 2019-01-30 15:29:34