繁体   English   中英

使用python美丽的汤提取html无效

[英]extract html using python beautiful soup is not working

我想抓取各州和城市组织的信息

这是我正在使用的Python脚本

import requests
import html5lib
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import pyrebase
import numpy as np
import yagmail
import time
import math
import colorama
import sys
from algoliasearch import algoliasearch

from datetime import datetime, timedelta

def getVendors():
    req = requests.Session()

    defaultlink = 'https://www.collierreporting.com/'

    driver.get(defaultlink)

    vendorsoup = BeautifulSoup(driver.page_source,"html5lib");

    statecontainer = vendorsoup.find_all("li")

    for state in statecontainer:

        stateref = state.find('a')['href']
        statename = state.find('a').contents[0]

        driver.get(stateref)
        statesoup = BeautifulSoup(driver.page_source,"html5lib");

        #GET CITIES
        citycontainer = statesoup.find_all("p")

        for city in citycontainer:
            cityref = city.find('a')['href']
            cityname = city.find('a')

            print( cityref, cityname)

        print(statename)

    print('Get vendors')

getVendors()

我能够在此html中刮擦状态

  <div class="content"> <div class="column_1"> <ul> <li><a href="https://www.collierreporting.com/state/al">Alabama</a></li> <li><a href="https://www.collierreporting.com/state/ak">Alaska</a></li> <li><a href="https://www.collierreporting.com/state/az">Arizona</a></li> <li><a href="https://www.collierreporting.com/state/ak">Arkansas</a></li> <li><a href="https://www.collierreporting.com/state/ca">California</a></li> </ul> </div> </div> 

但是当我尝试在此html中刮擦城市时,它不起作用

 <div class="content"> <div class="column_1"> <ul> <div style="margin-left: 20px;"><span style="font-style: italic;">Select a city to view dossiers.</span> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/alabaster-al">Alabaster</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/alexander-city-al">Alexander City</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/alexandria-al">Alexandria</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/aliceville-al">Aliceville</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/andalusia-al">Andalusia</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/anniston-al">Anniston</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/arab-al">Arab</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/ardmore-al">Ardmore</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/ashford-al">Ashford</a></p> </div> </ul> </div> </div> 

这是我得到的错误,无法弄清原因

Traceback (most recent call last):
  File "vendors.py", line 120, in getVendors()
  File "vendors.py", line 101, in getVendors cityref = city.find('a')['href']
TypeError: 'NoneType' object is not subscriptable

我不知道为什么这不起作用。 我尝试了获取href和城市名称的多种变体,但是我得到的只是一个相同的“对象不可下标”错误。

我更改城市集装箱以查找所有年龄段,并能够看到以下内容

citycontainer = statesoup.find_all("a")

for city in citycontainer:

        cityref = city['href']
        cityname = city.contents[0]

我不知道为什么与众不同,但是有效

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM