[英]Having some issues with Python Exceptions in my script
我正在尝试从几个网站上抓取数据以进行概念验证项目。 目前使用 Python3 和 BS4 来收集所需的数据。 我有一个来自三个站点的 URL 字典。 每个站点都需要不同的方法来收集数据,因为它们的 HTML 是不同的。 我一直在使用“Try, If, Else, stack 但我一直遇到问题,如果您可以查看我的代码并帮助我修复它,那就太好了!
当我添加更多要抓取的站点时,我将无法使用“Try, If, Else”循环通过各种方法来找到抓取数据的正确方法,我如何才能让这段代码在未来得到验证以允许我添加未来有多少网站并从其中包含的各种元素中抓取数据?
# Scraping Script Here:
def job():
prices = {
# LIVEPRICES
"LIVEAUOZ": {"url": "https://www.gold.co.uk/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "LiveAUOz"},
# GOLD
"GLDAU_BRITANNIA": {"url": "https://www.gold.co.uk/gold-coins/gold-britannia-coins/britannia-one-ounce-gold-coin-2020/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "Britannia"},
"GLDAU_PHILHARMONIC": {"url": "https://www.gold.co.uk/gold-coins/austrian-gold-philharmoinc-coins/austrian-gold-philharmonic-coin/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "Philharmonic"},
"GLDAU_MAPLE": {"url": "https://www.gold.co.uk/gold-coins/canadian-gold-maple-coins/canadian-gold-maple-coin/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "Maple"},
# SILVER
"GLDAG_BRITANNIA": {"url": "https://www.gold.co.uk/silver-coins/silver-britannia-coins/britannia-one-ounce-silver-coin-2020/",
"trader": "Gold.co.uk",
"metal": "Silver",
"type": "Britannia"},
"GLDAG_PHILHARMONIC": {"url": "https://www.gold.co.uk/silver-coins/austrian-silver-philharmonic-coins/silver-philharmonic-2020/",
"trader": "Gold.co.uk",
"metal": "Silver",
"type": "Philharmonic"}
}
response = requests.get(
'https://www.gold.co.uk/silver-price/')
soup = BeautifulSoup(response.text, 'html.parser')
AG_GRAM_SPOT = soup.find(
'span', {'name': 'current_price_field'}).get_text()
# Convert to float
AG_GRAM_SPOT = float(re.sub(r"[^0-9\.]", "", AG_GRAM_SPOT))
# No need for another lookup
AG_OUNCE_SPOT = AG_GRAM_SPOT * 31.1035
for coin in prices:
response = requests.get(prices[coin]["url"])
soup = BeautifulSoup(response.text, 'html.parser')
try:
text_price = soup.find(
'td', {'id': 'total-price-inc-vat-1'}).get_text() <-- Method 1
except:
text_price = soup.find(
'td', {'id': 'total-price-inc-vat-1'}).get_text() <-- Method 2
else:
text_price = soup.find(
'td', {'class': 'gold-price-per-ounce'}).get_text()
# Grab the number
prices[coin]["price"] = float(re.sub(r"[^0-9\.]", "", text_price))
# ============================================================================
root = etree.Element("root")
for coin in prices:
coinx = etree.Element("coin")
etree.SubElement(coinx, "trader", {
'variable': coin}).text = prices[coin]["trader"]
etree.SubElement(coinx, "metal").text = prices[coin]["metal"]
etree.SubElement(coinx, "type").text = prices[coin]["type"]
etree.SubElement(coinx, "price").text = (
"£") + str(prices[coin]["price"])
root.append(coinx)
fName = './templates/data.xml'
with open(fName, 'wb') as f:
f.write(etree.tostring(root, xml_declaration=True,
encoding="utf-8", pretty_print=True))
为抓取添加一个配置,其中每个配置是这样的:
prices = {
"LIVEAUOZ": {
"url": "https://www.gold.co.uk/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "LiveAUOz",
"price": {
"selector": '#id > div > table > tr',
"parser": lambda x: float(re.sub(r"[^0-9\.]", "", x))
}
}
}
使用 price 的 selector 部分获取 HTML 的相关部分,然后使用解析器 function 对其进行解析。
例如
for key, config in prices.items():
response = requests.get(config['url'])
soup = BeautifulSoup(response.text, 'html.parser')
price_element = soup.find(config['price']['selector'])
if price_element:
AG_GRAM_SPOT = price_element.get_text()
# convert to float
AG_GRAM_SPOT = config['price']['parser'](AG_GRAM_SPOT)
# etc
您可以根据需要修改配置 object,但对于大多数站点来说可能非常相似。 例如,文本解析很可能总是相同的,所以用 def 创建一个 function 而不是 lambda function。
def textParser(text):
return float(re.sub(r"[^0-9\.]", "", text))
然后在配置中添加对 textParser 的引用。
prices = {
"LIVEAUOZ": {
"url": "https://www.gold.co.uk/",
"trader": "Gold.co.uk",
"metal": "Gold",
"type": "LiveAUOz",
"price": {
"selector": '#id > div > table > tr',
"parser": textParser
}
}
}
这些步骤将允许您编写通用代码,保存所有那些尝试例外。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.