I am trying to extract Company Name, address, and zipcode from [www.quicktransportsolutions.com][1]
. I have written the following code to scrawl the site and return the information I need.
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('div', {'class': 'well well-sm'}):
title = link.string
print(link)
trade_spider(1)
After running the code, I see the information that I want, but I am confused to how to get it to print without all of the non-pertinent information.
Above the
print(link)
I thought that I could have link.string pull the Company Names, but that failed. Any suggestions?
Output:
div class="well well-sm">
<b>2 OLD BOYS TRUCKING LLC</b><br><a href="/truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Missouri Trucking Company 2 OLD BOYS TRUCKING ADRIAN"><u><span itemprop="name"><b>2 OLD BOYS TRUCKING</b></span></u></a><br> <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress"><a href="http://maps.google.com/maps?q=227+E+2ND,ADRIAN,MO+64720&ie=UTF8&z=8&iwloc=addr" target="_blank"><span itemprop="streetAddress">227 E 2ND</span></a>
<br>
<span itemprop="addressLocality">Adrian</span>, <span itemprop="addressRegion">MO</span> <span itemprop="postalCode">64720</span></br></span><br>
Trucks: 2 Drivers: 2<br>
<abbr class="initialism" title="Unique Number to identify Companies operating commercial vehicles to transport passengers or haul cargo in interstate commerce">USDOT</abbr> 2474795 <br><span class="glyphicon glyphicon-phone"></span><b itemprop="telephone"> 417-955-0651</b>
<br><a href="/inspectionreports/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Trucking Company 2 OLD BOYS TRUCKING Inspection Reports">
Everyone,
Thanks for the help so far... I'm trying to add an extra function to my little crawler. I have written the following code:
def Crawl_State_Page(max_pages):
url = 'http://www.quicktransportsolutions.com/carrier/alabama/trucking-companies.php'
while i <= len(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.find("table", {"class" : "table table-condensed table-striped table-hover table-bordered"})
for link in table.find_all(href=True):
print link['href']
Output:
abbeville.php
adamsville.php
addison.php
adger.php
akron.php
alabaster.php
alberta.php
albertville.php
alexander-city.php
alexandria.php
aliceville.php
alpine.php
... # goes all the way to Z I cut the output short for spacing..
What I'm trying to accomplish here is to pull all of the href with the city.php and write it to a file. .. But right now, i am stuck in an infinite loop where it keep cycling through the URL. Any tips on how to increment it? My end goal is to create another function that feeds back into my trade_spider with the www.site.com/state/city.php and then loops through all 50 dates... Something to the effect of
while i < len(states,cities):
url = "http://www.quicktransportsolutions.com/carrier" + states + cities[i] +"
And then this would loop into my trade_spider function, pulling all of the information that I needed.
But, before I get to that part, I need a bit of help getting out of my infinite loop. Any suggestions? Or foreseeable issues that I am going to run into?
I tried to create a crawler that would cycle through every link on the page, and then if it found content on the page that trade_spider could crawl, it would write it to a file... However, that was a bit out of my skill set, for now. So, i'm trying this method.
I would rely on the itemprop
attributes of the different tags for each company. They are conveniently set for name
, url
, address
etc:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for company in soup.find_all('div', {'class': 'well well-sm'}):
link = company.find('a', itemprop='url').get('href').strip()
name = company.find('span', itemprop='name').text.strip()
address = company.find('span', itemprop='address').text.strip()
print name, link, address
print "----"
trade_spider(1)
Prints:
2 OLD BOYS TRUCKING /truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php 227 E 2ND
Adrian, MO 64720
----
HILLTOP SERVICE & EQUIPMENT /truckingcompany/missouri/hilltop-service-equipment-usdot-1047604.php ROUTE 2 BOX 453
Adrian, MO 64720
----
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.