简体   繁体   中英

Fetching information from the different links on a web page and writing them to a .xls file using pandas,bs4 in Python

I am a beginner to Python Programming. I am practicing web scraping using bs4 module in python.

I have extracted some fields from a web page but it is extracting only 13 items whereas the web page has more than 13 items. I cannot understand why are the rest of the items not extracted.

Another thing is I want to extract the contact number and the email address of each item on the web page but they are available in the item's respective links. I am a beginner and frankly speaking I got stuck to how to access and scrape the link of each item's individual web page within a given web page. Kindly tell where am I doing wrong and if possible suggest what is to be done.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd


res = requests.post('https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A')
soup = bs(res.content, 'lxml')

data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})

records = []

for item in data:
    name = item.find('h2').text.strip()
    position = item.find('h3').text.strip()
    records.append({'Names': name, 'Position': position})

df = pd.DataFrame(records,columns=['Names','Position'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\NelsonAlexander.xls', sheet_name='MyData2', index = False, header=True)

I have done the above code for just extracting the names and the position of each items but it only scrapes 13 records but there are more records than it in the web page. I could not write any code for extracting the contact number and the email address of each record because it is present inside each item's individual page as I have got stuck.

The Excel sheet looks like this:

在此输入图像描述

That website is loading the list dynamically as you scroll, however you can trace the ajax request, and parse data directly:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0', 'Referer': 'https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A'}
records = []

with requests.Session() as s:

    for i in range(1, 22):

        res = s.get(f'https://www.nelsonalexander.com.au/real-estate-agents/page/{i}/?ajax=1&agent=A', headers=headers)
        soup = bs(res.content, 'lxml')

        data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})

        for item in data:
            name = item.find('h2').text.strip()
            position = item.find('h3').text.strip()
            phone = item.find("div",{"class":"small-6 columns text-left"}).find("a").get('href').replace("tel:", "")
            records.append({'Names': name, 'Position': position, 'Phone': phone})

df = pd.DataFrame(records,columns=['Names','Position', 'Phone'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\NelsonAlexander.xls', sheet_name='MyData2', index = False, header=True)

I am convinced that the emails are no where in the DOM. I made some modification to @drec4s code to instead go until there are no entries (dynamically).

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import itertools

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0', 'Referer': 'https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A'}
records = []

with requests.Session() as s:


    for i in itertools.count():

        res = s.get('https://www.nelsonalexander.com.au/real-estate-agents/page/{}/?ajax=1&agent=A'.format(i), headers=headers)
        soup = bs(res.content, 'lxml')

        data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})
        if(len(data) > 0):
            for item in data:
                name = item.find('h2').text.strip()
                position = item.find('h3').text.strip()
                phone = item.find("div",{"class":"small-6 columns text-left"}).find("a").get('href').replace("tel:", "")
                records.append({'Names': name, 'Position': position, 'Phone': phone})
                print({'Names': name, 'Position': position, 'Phone': phone})
        else:
            break

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM