简体   繁体   中英

Writing extracted items of from a website onto a .xls sheet with lists of different length using Pandas module in Python

I am a beginner in Python Programming and I am practicing scraping different values from websites. I have extracted the items from a particular website and now want to write them onto a .xls file.

The whole web page has 714 records including duplicate records but the excel sheet is displaying only 707 records because of the zip() function which stops when the smallest list gets exhausted. Here the smallest list is the email list. So it is getting exhausted and the iteration stops due to the property of zip() function.I have even kept a checking for it within a if condition for the records which has no email address so that it displays "No email address" but still the same result is displayed with 704 with duplicates records. Kindly tell where am I going wrong and if possible suggest what to be done regarding removing duplicate records and displaying "No email address" where there is no email.

from bs4 import BeautifulSoup as bs
import pandas as pd

res = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')

names=[]
positions=[]
phone=[]
emails=[]
links=[l1['href'] for l1 in soup.select('.agent-name a')]

nlist = soup.find_all('li', class_='agent-name')
plist= soup.find_all('li',class_='agent-role')
phlist = soup.find_all('li', class_='agent-officenum')
elist = soup.find_all('a',class_='val withicon')

for n1 in nlist:
    names.append(n1.text)
for p1 in plist:
    positions.append(p1.text)
for ph1 in phlist:
    phone.append(ph1.text)
for e1 in elist:
    emails.append(e1.get('href') if e1.get('href') is not None else 'No Email address')


df = pd.DataFrame(list(zip(names,positions,phone,emails,links)),columns=['Names','Position','Phone','Email','Link'])
df.to_excel(r'C:\Users\laptop\Desktop\RayWhite.xls', sheet_name='MyData2', index = False, header=True)

The excel sheet looks like this where we can see the last records name and it's email address does not match:

Ray White Excel Sheet

射线白Excel工作表

It looks like you are doing many find_all's and then stitching them together. My advice would be to do one find_all then iterate through that. It makes it a lot easier to build out the columns of your dataframe when all your data is in one place.

I have updated the below code to successfully extract links without error. With any code there is a number of ways to perform the same task. This one may not be the most elegant but it does get the job done.

import requests
from bs4 import BeautifulSoup 
import pandas as pd

r    = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact', headers = {'User-agent': 'Super Bot 9000'})
soup = BeautifulSoup(r.text, 'html.parser')

get_cards = soup.find_all("div",{"class":"card horizontal-split vcard"})

agent_list = []

for item in get_cards:
    name      = item.find('li', class_='agent-name').text
    position  = item.find('li', class_='agent-role').text
    phone     = item.find('li', class_='agent-officenum').text
    link      = item.find('li', class_='agent-name').a['href']

    try:
        email = item.find('a',class_='val withicon')['href'].replace('mailto:','')
    except:
        email = 'No Email address'
    agent_list.append({'name':name,'position':position,'email':email,'link':link})

df = pd.DataFrame(agent_list)

Above is some sample code I have put together to create the dataframe. The key here is to do one find_all on "class":"card horizontal-split vcard"}

Hope that has been some help.

Cheers, Adam

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM