简体   繁体   中英

Beautiful Soup nested div (Adding extra function)

I am trying to extract Company Name, address, and zipcode from [www.quicktransportsolutions.com][1] . I have written the following code to scrawl the site and return the information I need.

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in soup.findAll('div', {'class': 'well well-sm'}):
            title = link.string
            print(link)
trade_spider(1)

After running the code, I see the information that I want, but I am confused to how to get it to print without all of the non-pertinent information.

Above the

print(link)

I thought that I could have link.string pull the Company Names, but that failed. Any suggestions?

Output:

div class="well well-sm">
<b>2 OLD BOYS TRUCKING LLC</b><br><a href="/truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Missouri Trucking Company 2 OLD BOYS TRUCKING ADRIAN"><u><span itemprop="name"><b>2 OLD BOYS TRUCKING</b></span></u></a><br> <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress"><a href="http://maps.google.com/maps?q=227+E+2ND,ADRIAN,MO+64720&amp;ie=UTF8&amp;z=8&amp;iwloc=addr" target="_blank"><span itemprop="streetAddress">227 E 2ND</span></a>
<br>
<span itemprop="addressLocality">Adrian</span>, <span itemprop="addressRegion">MO</span> <span itemprop="postalCode">64720</span></br></span><br>
                Trucks: 2       Drivers: 2<br>
<abbr class="initialism" title="Unique Number to identify Companies operating commercial vehicles to transport passengers or haul cargo in interstate commerce">USDOT</abbr> 2474795                <br><span class="glyphicon glyphicon-phone"></span><b itemprop="telephone"> 417-955-0651</b>
<br><a href="/inspectionreports/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Trucking Company 2 OLD BOYS TRUCKING Inspection Reports">

Everyone,

Thanks for the help so far... I'm trying to add an extra function to my little crawler. I have written the following code:

def Crawl_State_Page(max_pages):
    url = 'http://www.quicktransportsolutions.com/carrier/alabama/trucking-companies.php'
    while i <= len(url):
        response = requests.get(url)
        soup = BeautifulSoup(response.content)
        table = soup.find("table", {"class" : "table table-condensed table-striped table-hover table-bordered"})
        for link in table.find_all(href=True):
            print link['href']

Output: 

    abbeville.php
    adamsville.php
    addison.php
    adger.php
    akron.php
    alabaster.php
    alberta.php
    albertville.php
    alexander-city.php
    alexandria.php
    aliceville.php


     alpine.php

... # goes all the way to Z I cut the output short for spacing.. 

What I'm trying to accomplish here is to pull all of the href with the city.php and write it to a file. .. But right now, i am stuck in an infinite loop where it keep cycling through the URL. Any tips on how to increment it? My end goal is to create another function that feeds back into my trade_spider with the www.site.com/state/city.php and then loops through all 50 dates... Something to the effect of

while i < len(states,cities):
    url = "http://www.quicktransportsolutions.com/carrier" + states + cities[i] +" 

And then this would loop into my trade_spider function, pulling all of the information that I needed.

But, before I get to that part, I need a bit of help getting out of my infinite loop. Any suggestions? Or foreseeable issues that I am going to run into?

I tried to create a crawler that would cycle through every link on the page, and then if it found content on the page that trade_spider could crawl, it would write it to a file... However, that was a bit out of my skill set, for now. So, i'm trying this method.

I would rely on the itemprop attributes of the different tags for each company. They are conveniently set for name , url , address etc:

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
        response = requests.get(url)
        soup = BeautifulSoup(response.content)
        for company in soup.find_all('div', {'class': 'well well-sm'}):
            link = company.find('a', itemprop='url').get('href').strip()
            name = company.find('span', itemprop='name').text.strip()
            address = company.find('span', itemprop='address').text.strip()

            print name, link, address
            print "----"

trade_spider(1)

Prints:

2 OLD BOYS TRUCKING /truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php 227 E 2ND

Adrian, MO 64720
----
HILLTOP SERVICE & EQUIPMENT /truckingcompany/missouri/hilltop-service-equipment-usdot-1047604.php ROUTE 2 BOX 453

Adrian, MO 64720
----

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM