简体   繁体   中英

TypeError: string indices must be integers when trying to print a href

I'm trying to scrape the detail from inside the 25 links of this site: https://beta.companieshouse.gov.uk/search/companies?q=SW181Db&page=1

'/company/08569390' is a href tag in the underlying html code so essentially i'm trying to concatentate the base_url ('https://beta.companieshouse.gov.uk/) and the text in the href so I can get my loop to traverse through the 25 pages.

The code I have (below) is giving me the message TypeError: string indices must be integers .

Would someone kindly explain to me where I'm going wrong here? Do I need to convert the contents of the href to an integer, even thought it also contains some text as well ( /company/ )?

import requests
from bs4 import BeautifulSoup
import csv
base_url = 'https://beta.companieshouse.gov.uk/'

header={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch, br',
'Accept-Language':'en-US,en;q=0.8,fr;q=0.6',
'Connection':'keep-alive',
'Cookie':'mdtp=y4Ts2Vvql5V9MMZNjqB9T+7S/vkQKPqjHHMIq5jk0J1l5l131dU0YXsq7Rr15GDyghKHrS/qcD2vdsMCVtzKByJEDZFI+roS6tN9FN5IS70q8PkCCBjgFPDZjlR1A3H9FJ/zCWXMNJbaXqF8MgqE+nhR3/lji+eK4mm/GP9b8oxlVdupo9KN9SKanxu/JFEyNXutjyN+BsxRztNem1Z+ExSQCojyxflI/tc70+bXAu3/ppdP7fIXixfEOAWezmOh3ywchn9DV7Af8wH45t8u4+Y=; mdtpdi=mdtpdi#f523cd04-e09e-48bc-9977-73f974d50cea#1484041095424_zXDAuNhEkKdpRUsfXt+/1g==; seen_cookie_message=yes; _ga=GA1.4.666959744.1484041122; _gat=1',
'Host':'https://beta.companieshouse.gov.uk/',
#'Referer':'https://beta.companieshouse.gov.uk/',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.51 Safari/537.36'
}

session = requests.session()
url = 'https://beta.companieshouse.gov.uk/search/companies?q=SW181Db&page=1'
response = session.get(url, headers=header)
soup = BeautifulSoup(response.content,"lxml")  

rslt_table = soup.find("article")

for elem in rslt_table:
    det_url = base_url+elem['href']
    print det_url

I played around with your code for a bit and ended up solving your problem. The changes I made are:

links=[]
headers=soup.findAll("h3")
for header in headers:
    det_url = base_url+header.find('a')['href']
    links.append(det_url)
    print det_url

print links

The output I get is:

 ['https://beta.companieshouse.gov.uk//company/08569390', 'https://beta.companieshouse.gov.uk//company/09947251', 'https://beta.companieshouse.gov.uk//company/07352770', 'https://beta.companieshouse.gov.uk//company/07908180', 'https://beta.companieshouse.gov.uk//company/04576887', 'https://beta.companieshouse.gov.uk//company/08760943', 'https://beta.companieshouse.gov.uk//company/08265394', 'https://beta.companieshouse.gov.uk//company/03893510', 'https://beta.companieshouse.gov.uk//company/07422059', 'https://beta.companieshouse.gov.uk//company/08819027', 'https://beta.companieshouse.gov.uk//company/08325123', 'https://beta.companieshouse.gov.uk//company/09669365', 'https://beta.companieshouse.gov.uk//company/08641990', 'https://beta.companieshouse.gov.uk//company/06318392', 'https://beta.companieshouse.gov.uk//company/09400775', 'https://beta.companieshouse.gov.uk//company/01930797', 'https://beta.companieshouse.gov.uk//company/09398542', 'https://beta.companieshouse.gov.uk//company/07784981', 'https://beta.companieshouse.gov.uk//company/07480763', 'https://beta.companieshouse.gov.uk//company/06971238']

soup.find("article") is not how you locate all those company tags, try to usefind_all instead:

base_url = 'https://beta.companieshouse.gov.uk'

companies = soup.find_all('a', {'title': 'View company'}) # to get all company <a> tags

for company in companies:
    det_url = base_url+elem['href']
    print det_url

This line:

rslt_table = soup.find("article")

returns you one article element. When you do this:

for elem in rslt_table:

you're looping over each element of article as they are in plain text. Thus elem is a string and cannot be indexed by another string, as you're trying to do with elem["href"] . What you want to do is to get the a elements, not the strings, inside rslt_table :

for elem in rslt_table.find_all("a"):

Changing this line will give you what you want.

If you're looking for companies in particular postcodes, you may prefer to download this dataset rather than scraping: http://download.companieshouse.gov.uk/en_output.html

Companies House also offer an API which you might find useful: https://developer.companieshouse.gov.uk/api/docs/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM