[英]Remove unwanted characters from string with BeautifulSoup Python when selecting words from string
我是Python的新手,但仍然不了解所有内容及其功能,但是我已经接近要达到的目标。
从本质上讲,我有程序可以从网站上抓取我想要的数据,但是当它从“ specs”字符串中打印选定的单词/项目时,它也在从字符串中打印诸如[]和”之类的字符。
这个例子是我试图从li的列表中获取“ gearbox”类型,“ fuel”类型和“ mileage”,我已经用植物将其转换为字符串,然后从该字符串中选择特定项目。
我在当前程序中得到的是:
['Manual'] ['Petrol'] ['86,863 miles']
我想要实现的是这样的打印结果:
手动,汽油,86,863英里
将其导出到.csv中的单独列时,应在相应标题下的正确列中显示。
我试过.text只删除文本,但是显示出来的'list'对象没有属性'text'错误。
import csv
import requests
from bs4 import BeautifulSoup
outfile = open('pistonheads.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Link", "Make", "Model", "Price", "Image Link",
"Gearbox", "Fuel", "Mileage"])
url = 'https://www.pistonheads.com/classifieds?Category=used- cars&Page=1&ResultsPerPage=100'
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, 'html.parser')
car_link = soup.find_all('div', 'listing-headline', 'price')
for div in car_link:
links = div.findAll('a')
for a in links:
link = ("https://www.pistonheads.com" + a['href'])
make = (a['href'].split('/')[-4])
model = (a['href'].split('/')[-3])
price = a.find('span').text.rstrip()
image_link = a.parent.parent.find('img')['src']
image = ("https:") + image_link
vehicle_details = a.parent.parent.find('ul', class_='specs')
specs = list(vehicle_details.stripped_strings)
gearbox = specs[3:]
fuel = specs[1:2]
mileage = specs[0:1]
writer.writerow([link, make, model, price, image, gearbox, fuel, mileage])
print(link, make, model, price, image, gearbox, fuel, mileage)
outfile.close()
欢迎来到StackOverflow!
因此,您的脚本有很多改进之处。 你到了那里!
specs = list(vehicle_details.stripped_strings)
是解析为列表的生成器。 有效地,您可以通过索引访问所需的内容。 例如, mileage
可以简单地是specs[0]
。 [
和]
是由您使用切片mileage = specs[0:1]
。 从文档中索引将返回一个项目,切片将返回一个新列表 。 请参阅清单简介 。 mileage, fuel, _, gearbox = specs
mileage = specs[0]
import pdb; pdb.set_trace() # temp set on one line so you can remove it easily after
# now you can interactively inspect your code
(Pdb) specs
祝好运! 享受Python!
如果您想从列表中获取字符串,也许可以这样做
gearbox = specs[3:][0] if specs[3:] else '-'
fuel = specs[1:2][0] if specs[1:2] else '-'
mileage = specs[0:1][0] if specs[0:1] else '-'
但是这种方式或aldnav的答案甚至会给出错误的结果甚至抛出错误
ValueError:没有足够的值可解压缩
通常我会先提取父容器,而不是选择子容器( a
),然后再选择父容器。
# helper to get dynamic specs element
def getSpec(element, selector):
spec = element.select_one(selector)
return spec.nextSibling.string.strip() if spec else '-'
soup = BeautifulSoup(get_text, 'html.parser')
results = soup.find_all('div', class_="result-contain")
for car in results:
a = car.find('a')
if not a:
continue
link = ("https://www.pistonheads.com" + a['href'])
make = (a['href'].split('/')[-4])
model = (a['href'].split('/')[-3])
price = a.find('span').text.rstrip()
image_link = car.find('img')['src']
image = ("https:") + image_link
if not car.find('ul', class_='specs'):
gearbox = fuel = mileage = '-'
else:
gearbox = getSpec(car, '.location-pin-4')
fuel = getSpec(car, '.gas-1')
mileage = getSpec(car, '.gauge-1')
print(gearbox, fuel, mileage)
writer.writerow([link, make, model, price, image, gearbox, fuel, mileage])
#print(link, make, model, price, image, gearbox, fuel, mileage)
outfile.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.