簡體   English   中英

使用python美麗的湯提取html無效

[英]extract html using python beautiful soup is not working

我想抓取各州和城市組織的信息

這是我正在使用的Python腳本

import requests
import html5lib
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import pyrebase
import numpy as np
import yagmail
import time
import math
import colorama
import sys
from algoliasearch import algoliasearch

from datetime import datetime, timedelta

def getVendors():
    req = requests.Session()

    defaultlink = 'https://www.collierreporting.com/'

    driver.get(defaultlink)

    vendorsoup = BeautifulSoup(driver.page_source,"html5lib");

    statecontainer = vendorsoup.find_all("li")

    for state in statecontainer:

        stateref = state.find('a')['href']
        statename = state.find('a').contents[0]

        driver.get(stateref)
        statesoup = BeautifulSoup(driver.page_source,"html5lib");

        #GET CITIES
        citycontainer = statesoup.find_all("p")

        for city in citycontainer:
            cityref = city.find('a')['href']
            cityname = city.find('a')

            print( cityref, cityname)

        print(statename)

    print('Get vendors')

getVendors()

我能夠在此html中刮擦狀態

  <div class="content"> <div class="column_1"> <ul> <li><a href="https://www.collierreporting.com/state/al">Alabama</a></li> <li><a href="https://www.collierreporting.com/state/ak">Alaska</a></li> <li><a href="https://www.collierreporting.com/state/az">Arizona</a></li> <li><a href="https://www.collierreporting.com/state/ak">Arkansas</a></li> <li><a href="https://www.collierreporting.com/state/ca">California</a></li> </ul> </div> </div> 

但是當我嘗試在此html中刮擦城市時,它不起作用

 <div class="content"> <div class="column_1"> <ul> <div style="margin-left: 20px;"><span style="font-style: italic;">Select a city to view dossiers.</span> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/alabaster-al">Alabaster</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/alexander-city-al">Alexander City</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/alexandria-al">Alexandria</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/aliceville-al">Aliceville</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/andalusia-al">Andalusia</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/anniston-al">Anniston</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/arab-al">Arab</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/ardmore-al">Ardmore</a></p> <p style="margin-bottom: 7px; margin-top: 10px;"><a href="https://www.collierreporting.com/city/ashford-al">Ashford</a></p> </div> </ul> </div> </div> 

這是我得到的錯誤,無法弄清原因

Traceback (most recent call last):
  File "vendors.py", line 120, in getVendors()
  File "vendors.py", line 101, in getVendors cityref = city.find('a')['href']
TypeError: 'NoneType' object is not subscriptable

我不知道為什么這不起作用。 我嘗試了獲取href和城市名稱的多種變體,但是我得到的只是一個相同的“對象不可下標”錯誤。

我更改城市集裝箱以查找所有年齡段,並能夠看到以下內容

citycontainer = statesoup.find_all("a")

for city in citycontainer:

        cityref = city['href']
        cityname = city.contents[0]

我不知道為什么與眾不同,但是有效

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM