简体   繁体   中英

Python Web Scraping Using Selenium

Website to scrape https://idaman.kpkt.gov.my/idv5xe/98_eHome/maklumatProjek.cfm?pmju_kod=8898&proj_kod_Fasa=1

Item to scrape in BOLD - Part 1 (HTML Below)

 <form onsubmit="return lucee_form_c9u.check();" name="myForm" enctype="multipart/form-data" action="mPPTProjek3.cfm?mn=BPPT" method="post"> <div align="center" style="background-color: white; border: 1px solid grey;"> <br /> <table class="MainContent" width="100%" align="center"> <tbody> <tr style="font-weight: bold;"> <td class="column" width="30%">Nama Pemaju</td> <td>: <a style="color: blue;" href="maklumatPemaju.cfm?pmju_Kod=8877">**RAPID UNITY SDN. BHD.**</a> <font color="red">* Klik Untuk Melihat Maklumat</font> </td> </tr> <tr> <td class="column">Kod Pemaju</td> <td>: **8877<**/td></td> </tr> <tr> <td class="column">Kod Fasa</td> <td>: **1<**/td></td> </tr> <tr> <td class="column">Nama Pemajuan</td> <td>: **TAMAN UNITY**</td> </tr> </tbody> </table> </div> </form>

Item to scrape in BOLD - Part 2


This code required selenium driver to click on the link

  1. <tr align="center" onclick="change3('15536',this)" style="cursor:pointer" bgcolor="DAEEF3" onmouseover="this.bgColor='#FF9900'" onmouseout="this.bgColor='DAEEF3'">

then only the 95% after "name:myForm" will appear`

  1. <tr align="center" onclick="change3('15536',this)" style="cursor:pointer" bgcolor="DAEEF3" onmouseover="this.bgColor='#FF9900'" onmouseout="this.bgColor='DAEEF3'">

then the 95% will change to other amount

(HTML Below)

 <fieldset title="Maklumat Pemajuan Projek" style="border: 1px solid grey; font-weight: bold; color: black;"> <legend>Maklumat Pemajuan Projek</legend> <table class="MainContent" width="100%" align="center"> <thead> <tr class="column"> <th>Bil</th> <th>Bil Unit</th> <th> Jenis<br /> Rumah </th> <th> Kategori<br /> Rumah </th> <th>Tingkat</th> <th> Harga<br /> Min (RM) </th> <th> Harga<br /> Max (RM) </th> </tr> </thead> <tbody> <tr align="center" onclick="change2('15535',this)" style="cursor: pointer;" bgcolor="white" onmouseover="this.bgColor='#FF9900'" onmouseout="this.bgColor='white'"> <td>**1**</td> <td>**2**</td> <td align="left"> **RUMAH BERKEMBAR** </td> <td> **HARGA TINGGI** </td> <td>**1**</td> <td align="right">**370,000.00**</td> <td align="right">**394,900.00**</td> </tr> <tr align="center" onclick="change3('15536',this)" style="cursor: pointer;" bgcolor="DAEEF3" onmouseover="this.bgColor='#FF9900'" onmouseout="this.bgColor='DAEEF3'"> <td>**2**</td> <td>**18**</td> <td align="left"> **RUMAH TERES** </td> <td> **HARGA TINGGI** </td> <td>**1**</td> <td align="right">**190,000.00**</td> <td align="right">**290,550.00**</td> </tr> </tbody> </table> <br /> <input name="rekid3" id="rekid3" type="hidden" value="15535" /> <div id="pemajuan"> <script language="JavaScript" type="text/javascript" src="/lucee/formtag-form.cfm"></script> <script language="JavaScript" type="text/javascript"> function _CF_checkmyForm() { return lucee_form_czz.check(); } </script> <table class="MainContent" width="100%" align="center"> <tbody> <tr> <td class="column" width="30%">Jenis Rumah</td> <td>: RUMAH BERKEMBAR</td> </tr> <tr> <td class="column">Kategori Rumah</td> <td>: HARGA TINGGI</td> </tr> <tr> <td class="column">Bil Tingkat</td> <td>: 1</td> </tr> <tr> <td class="column">Bil Unit</td> <td>: 2</td> </tr> <tr> <td class="column">Harga Minimum</td> <td>: 370,000.00</td> </tr> <tr> <td class="column">Harga Maximum</td> <td>: 394,900.00</td> </tr> <tr> <td class="column">Peratusan Pemajuan</td> <td>: **95%**</td> </tr> </tbody> </table> <:-- name,myForm --> <script> lucee_form_czz = new LuceeForms("myForm"; null); </script> </div> </fieldset>

The below is the code and believe me, is all I can write after weeks....please help me as I do not know how to

  1. Scrape part 1 then
  2. Click the 1st link in part 2 then
  3. Scrape part 2 then
  4. Click the 2nd link in part 2 then
  5. Scrape part 2 and append.
from selenium import webdriver
from selenium.webdriver.common.by import By
    
url = "https://idaman.kpkt.gov.my/idv5xe/98_eHome/maklumatProjek.cfm?pmju_kod=8898&proj_kod_Fasa=1"
    
driver = webdriver.Chrome(executable_path='/Users/freddielee/Downloads/chromedriver')

driver.find_element(By.NAME="need help here")

I think what you want can be obtained just using requests and beautifulsoup as follows:

import requests
from bs4 import BeautifulSoup

s = requests.Session()

params = {"pmju_Kod" : 8877, "proj_Kod_Fasa" : 1}
r = s.get("https://idaman.kpkt.gov.my/idv5xe/98_eHome/maklumatProjek.cfm", params=params)
soup = BeautifulSoup(r.content, "html.parser")

tables = soup.find_all('table', class_="MainContent")

items = []

items.append(tables[0].a.text)

data = [[td.text for td in tr.find_all('td')] for tr in tables[0].find_all('tr')]
items.append(data[1][1].strip(': '))
items.append(data[2][1].strip(': '))
items.append(data[3][1].strip(': '))

data = [[td.text for td in tr.find_all('td')] for tr in tables[3].find_all('tr')]

items.append(data[1][2].strip())
items.append(data[1][3].strip())
items.append(data[1][4])
items.append(data[1][5])
items.append(data[1][6])

items.append(data[2][2].strip())
items.append(data[2][3].strip())
items.append(data[2][4])
items.append(data[2][5])
items.append(data[2][6])

# Pemajuan table
params['rekid'] = 419975503
r2 = s.get('https://idaman.kpkt.gov.my/idv5xe/98_eHome/template/pemajuan.cfm', params=params)
soup2 = BeautifulSoup(r2.content, "html.parser")
table = soup2.find('table', class_="MainContent")
data = [[td.text for td in tr.find_all('td')] for tr in table.find_all('tr')]
items.append(data[-1][1].strip(': '))

print(items)

This would give you the following items:

['RAPID UNITY SDN. BHD.', '8877', '1', 'TAMAN UNITY', 'RUMAH BERKEMBAR', 'HARGA TINGGI', '1', '370,000.00', '394,900.00', 'RUMAH TERES', 'HARGA TINGGI', '1', '190,000.00', '290,550.00', '0%']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM