简体   繁体   中英

How to loop click with selenium and scrape each table with bs4?

I'm trying to scrape some hidden tables (15 tables per page) which are expanded after clicking an arrow. (I'm attaching pictures: Unexpanded tables Expanded tables )

I'm attaching the HTML, too (sorry, it's a bit long)

<table class="footable table toggle-arrow-tiny default breakpoint footable-loaded" transparenturl="Images/arrow_none.gif" ascendingurl="Images/arrow_up.gif" customsortdirection="Ascending" custompageindex="0" customsortfield="fullname" custompagealphaindex="A" custompagemode="ABC" custompagealpharelative="A" descendingurl="Images/arrow_down.gif" customvirtualcount="1605" id="MainContent_gw_partners" style="border-collapse:collapse;" cellspacing="0">
    <thead>
        <tr>
            <th data-toggle="true" scope="col" class="footable-visible footable-first-column"> &nbsp;&nbsp;</th><th data-ignore="true" data-hide="phone, tablet" scope="col" class="footable-visible"> &nbsp;&nbsp;</th><th data-ignore="true" data-hide="phone, tablet" scope="col" class="footable-visible">Titolo&nbsp;&nbsp;</th><th scope="col" class="footable-visible">Cognome&nbsp;&nbsp;</th><th data-ignore="true" data-hide="phone, tablet" scope="col" class="footable-visible">NPA&nbsp;&nbsp;</th><th data-ignore="true" data-hide="phone" scope="col" class="footable-visible">Luogo&nbsp;&nbsp;</th><th data-ignore="true" data-hide="phone" scope="col" class="footable-visible footable-last-column">Cantone&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Discipline(s) thérapeutique(s)&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Società&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Cognome&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">C/O&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Via&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">NPA&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Luogo&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Tel / Cellulare&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Cellulare  &nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Fax&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">e-mail&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Sito WEB&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Altri luoghi di lavoro&nbsp;&nbsp;</th><th data-hide="all" scope="col" style="display: none;">Discipline(s) thérapeutique(s)&nbsp;&nbsp;</th>
        </tr>
    </thead><tbody>
        <tr class="row_white footable-detail-show">
            <td class="footable-visible footable-first-column"><span class="footable-toggle"></span>&nbsp;</td><td class="footable-visible">

                    </td><td class="footable-visible">&nbsp;</td><td class="footable-visible">

                        ABBONDANZIERI Katia
                    </td><td class="footable-visible">
                        1204
                        <br>

                    </td><td class="footable-visible">
                        Genève
                        <br>

                    </td><td class="footable-visible footable-last-column">
                        GE
                        <br>

                    </td><td style="display: none;">
                        197.&nbsp;Omeopatia, 202.&nbsp;Linfodrenaggio&nbsp;manuale, 205.&nbsp;Massaggio&nbsp;classico, 664.&nbsp;Riflessoterapia&nbsp;generale
                    </td><td style="display: none;">

                    </td><td style="display: none;">
                        ABBONDANZIERI Katia
                    </td><td style="display: none;">


                    </td><td style="display: none;">
                        Place du Cirque, 2
                    </td><td style="display: none;">
                        1204
                    </td><td style="display: none;">
                        Genève
                    </td><td style="display: none;">
                        022 328 23 44 
                    </td><td style="display: none;">
                        079 601 92 75 
                    </td><td style="display: none;">

                    </td><td style="display: none;">

                    </td><td style="display: none;">

                    </td><td style="display: none;">

                    </td><td style="display: none;">
                        <div class="thZone"><div class="zCat">METHODES DE MASSAGE</div><div class="zThr">Linfodrenaggio manuale</div><div class="zThr">Massaggio classico</div><div class="zCat">METHODES PRESCRIPTIVES</div><div class="zThr">Omeopatia</div><div class="zCat">METHODES REFLEXES</div><div class="zThr">Riflessoterapia generale</div></div>
                    </td>
        </tr><tr class="footable-row-detail" style="display: table-row;"><td class="footable-row-detail-cell" colspan="7"><div class="footable-row-detail-inner"><div class="footable-row-detail-row"><div class="footable-row-detail-name">Discipline(s) thérapeutique(s):</div><div class="footable-row-detail-value">197.&nbsp;Omeopatia, 202.&nbsp;Linfodrenaggio&nbsp;manuale, 205.&nbsp;Massaggio&nbsp;classico, 664.&nbsp;Riflessoterapia&nbsp;generale</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Cognome:</div><div class="footable-row-detail-value">ABBONDANZIERI Katia</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Via:</div><div class="footable-row-detail-value">Place du Cirque, 2</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">NPA:</div><div class="footable-row-detail-value">1204</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Luogo:</div><div class="footable-row-detail-value">Genève</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Tel / Cellulare:</div><div class="footable-row-detail-value">022 328 23 44</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Cellulare:</div><div class="footable-row-detail-value">079 601 92 75</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Discipline(s) thérapeutique(s):</div><div class="footable-row-detail-value"><div class="thZone"><div class="zCat">METHODES DE MASSAGE</div><div class="zThr">Linfodrenaggio manuale</div><div class="zThr">Massaggio classico</div><div class="zCat">METHODES PRESCRIPTIVES</div><div class="zThr">Omeopatia</div><div class="zCat">METHODES REFLEXES</div><div class="zThr">Riflessoterapia generale</div></div></div></div></div></td></tr><tr class="row_grey footable-detail-show">
            <td class="footable-visible footable-first-column"><span class="footable-toggle"></span>&nbsp;</td><td class="footable-visible">

                            <a href="http://www.kinesiopourtous.ch" target="_blank">
                                <img title="Link internet" alt="" style="MARGIN-RIGHT: 7px" src="Images/pictoSiteInternet.jpg" width="12" height="12" border="0">
                            </a>

                    </td><td class="footable-visible">&nbsp;</td><td class="footable-visible">
                        <img id="MainContent_gw_partners_img1_1" src="Images/multi.gif">
                        ABEGG Sophie
                    </td><td class="footable-visible">
                        1212
                        <br>
                        1875<br>
                    </td><td class="footable-visible">
                        Grand-Lancy
                        <br>
                        <nobr>Morgins</nobr><nobr><br>
                    </nobr></td><td class="footable-visible footable-last-column">
                        GE
                        <br>
                        VS<br>
                    </td><td style="display: none;">
                        199.&nbsp;Kinesiologia
                    </td><td style="display: none;">
                        Kinéso pour tous
                    </td><td style="display: none;">
                        ABEGG Sophie
                    </td><td style="display: none;">


                    </td><td style="display: none;">
                        Rue du Bachet 8
                    </td><td style="display: none;">
                        1212
                    </td><td style="display: none;">
                        Grand-Lancy
                    </td><td style="display: none;">

                    </td><td style="display: none;">
                        076 365 63 86
                    </td><td style="display: none;">

                    </td><td style="display: none;">

                            <a href="mailto:sophie@kinesiopourtous.ch">sophie[at]kinesiopourtous.ch
                            </a>

                    </td><td style="display: none;">

                            <a href="http://www.kinesiopourtous.ch" target="_blank">
                                www.kinesiopourtous.ch
                            </a>

                    </td><td style="display: none;">
                        Résidence Bellevue, Rte de France 22, 1875 Morgins, CH<br>
                    </td><td style="display: none;">
                        <div class="thZone"><div class="zCat">METHODES ENERGETIQUES MANUELLES</div><div class="zThr">Kinesiologia</div></div>
                    </td>
        </tr><tr class="footable-row-detail"><td class="footable-row-detail-cell" colspan="7"><div class="footable-row-detail-inner"><div class="footable-row-detail-row"><div class="footable-row-detail-name">Discipline(s) thérapeutique(s):</div><div class="footable-row-detail-value">199.&nbsp;Kinesiologia</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Società:</div><div class="footable-row-detail-value">Kinéso pour tous</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Cognome:</div><div class="footable-row-detail-value">ABEGG Sophie</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Via:</div><div class="footable-row-detail-value">Rue du Bachet 8</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">NPA:</div><div class="footable-row-detail-value">1212</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Luogo:</div><div class="footable-row-detail-value">Grand-Lancy</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Cellulare:</div><div class="footable-row-detail-value">076 365 63 86</div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">e-mail:</div><div class="footable-row-detail-value"><a href="mailto:sophie@kinesiopourtous.ch">sophie[at]kinesiopourtous.ch
                            </a></div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Sito WEB:</div><div class="footable-row-detail-value"><a href="http://www.kinesiopourtous.ch" target="_blank">
                                www.kinesiopourtous.ch
                            </a></div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Altri luoghi di lavoro:</div><div class="footable-row-detail-value">Résidence Bellevue, Rte de France 22, 1875 Morgins, CH<br></div></div><div class="footable-row-detail-row"><div class="footable-row-detail-name">Discipline(s) thérapeutique(s):</div><div class="footable-row-detail-value"><div class="thZone"><div class="zCat">METHODES ENERGETIQUES MANUELLES</div><div class="zThr">Kinesiologia</div></div></div></div></div></td></tr><tr class="row_white">
            <td class="footable-visible footable-first-column"><span class="footable-toggle"></span>&nbsp;</td><td class="footable-visible">

So I'm using Selenium to click and BeautifulSoup 4 to scrape tables.

I would like to create a loop to click each arrow (15 arrows in each page) and scrape the data from each table (13 rows in each table. If data is missing the cell should blank in the outputed excel file).

Any help, please?

If you inspect, you can see it's Request Method: POST so used a different method.

If you'd prefer to still use selenium, just let me know and I can try to work that way out too.

You're going to need to go grab the Form Data, and copy that into the payload dictionary. I did not include the whole thing, because it's just too long, but I included a snipit of it in the code so you could see the format.

在此处输入图片说明

Then I just used pandas to grab the table with the data.

import requests
import bs4
import pandas as pd


url = 'http://www.asca.ch/Partners.aspx?lang=it'
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Length': '55755',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': '_ga=GA1.2.1140629371.1547917375; _gid=GA1.2.1588639047.1547917375; ASP.NET_SessionId=fmxjh5jxwuq10awmqch1ztjz; __AntiXsrfToken=1d9c575ab1494ab29d2e796e2853eaac; _gat=1',
'Host': 'www.asca.ch',
'Origin': 'http://www.asca.ch',
'Referer': 'http://www.asca.ch/Partners.aspx?lang=it',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'X-MicrosoftAjax': 'Delta=true',
'X-Requested-With': 'XMLHttpRequest'}


payload = {
'ctl00$RadScriptManagerMaster': 'ctl00$RadScriptManagerMaster|ctl00$MainContent$btn_submit',
'RadStyleSheetManager1_TSSM': ';|636398747139118389:c7e0c438;|636304438089400012:39e38b4c;|636304438089880540:19119943;|636304438090200892:b81c9af7;|636304438090180870:bb009068;|636304438089390001:e78ed9b3;|636325253237635520:dedafabf;|636304438089530155:5961cfc1;|636304438090290991:d08fa23c;|636304438089530155:7fafd27a',
'RadScriptManagerMaster_TSM': ';;System.Web.Extensions, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35:en-US:af7dd01d-1544-48f6-a85d-1285ae370050:ea597d4b:b25378d2;||:460a097d:7a38c288:ace9a216;Telerik.Web.UI, Version=2014.1.403.40, Culture=neutral, PublicKeyToken=121fae78165ba3d4:en-US:ca584452-327f-4858-bf00-fb22c6f6fd75:16e4e7cd:ed16cbdc:f7645509:24ee1bba:f46195d3:2003d0b8:88144a7a:1e771326:aa288e2d:258f1c72:7165f74;',
'ctl00$MainContent$ddl_partners':'' ,
'ctl00_MainContent_ddl_partners_ClientState':'' ,
'ctl00$MainContent$ddl_countries': 'Suisse',
'ctl00_MainContent_ddl_countries_ClientState': '',
'ctl00$MainContent$ddl_cantons': 'GE',

...
...

'__ASYNCPOST': 'true',
'RadAJAXControlID': 'ctl00_MainContent_RadAjaxManager1'
}


r = requests.post(url, headers=headers, data=payload)
soup = r.text

tables = pd.read_html(r.text)
data = tables[0]

Output:

print (data)
    Unnamed: 0                        ...                                           Discipline(s) thérapeutique(s).1
0          NaN                        ...                          METHODES DE MASSAGELinfodrenaggio manualeMassa...
1          NaN                        ...                                METHODES ENERGETIQUES MANUELLESKinesiologia
2          NaN                        ...                                      METHODES DE MASSAGEMassaggio classico
3          NaN                        ...                          METHODES AYURVEDIQUESHatha YogaMETHODES PSYCHO...
4          NaN                        ...                          METHODES DE MASSAGEMassaggio classicoMETHODES ...
5          NaN                        ...                                            METHODES PRESCRIPTIVESOmeopatia
6          NaN                        ...                          METHODES ENERGETIQUES MANUELLESReikiMETHODES O...
7          NaN                        ...                          METHODES DE MASSAGEMassaggio tradizionale thai...
8          NaN                        ...                          METHODES DE MASSAGEMassaggio classicoMassaggio...
9          NaN                        ...                                      METHODES DE MASSAGEMassaggio empirico
10         NaN                        ...                          METHODES PSYCHOLOGIQUES COMPLEMENTAIRESConsigl...
11         NaN                        ...                          METHODES PRESCRIPTIVESConsigli dietetici (MCO)...
12         NaN                        ...                          METHODES DE MASSAGEMassaggio classicoMassaggio...
13         NaN                        ...                                   METHODES DE MASSAGEMassaggio terapeutico
14         NaN                        ...                          METHODES DE MASSAGELinfodrenaggio manualeMETHO...

[15 rows x 21 columns]

Selenium way to expand those tables. There is a better way to handle the tie it takes to load, but just wanted to get this to you asap, so just went with a time.sleep

from selenium import webdriver
import time


url = 'http://www.asca.ch/Partners.aspx?lang=it'

driver = webdriver.Chrome()
driver.get(url)

# Click the dropdown, select GE, click Confermo, click Ricerca
driver.find_element_by_xpath('//*[@id="ctl00_MainContent_ddl_cantons_Arrow"]').click()
time.sleep(2)

driver.find_element_by_xpath('//*[@id="ctl00_MainContent_ddl_cantons_DropDown"]/div/ul/li[9]').click()
driver.find_element_by_xpath('//*[@id="MainContent__chkDisclaimer"]').click()
driver.find_element_by_xpath('//*[@id="MainContent_btn_submit"]').click()
time.sleep(5)

#Function to Expand Tables
def expand_tables():
    rows = driver.find_elements_by_xpath('//*[@id="MainContent_gw_partners"]/tbody/tr')
    for row in rows:
        row.click()

# Function to Click Next Page        
def click_next_page():
    driver.find_element_by_xpath('//*[@id="MainContent_btnNextPackId"]').click()



page = 1
num_of_pages = True
while num_of_pages == True:
    print ('Page: %s' %page)
    expand_tables()

    ## Your code to Parse the Tables ## 

    try:
        click_next_page()
        page += 1
    except:
        print ('You are at the end')


    time.sleep(2)






# When finished
driver.close()

Sorry, I couldn't fit my code to the comments, so I'm posting as an answer.

This is my code for parsing tables:

# To find all the tables
table = soup.find('table', {'class': 'footable'})

# To get all rows in that table
rows = table.find_all('tr')

# A function to process each row
def processRow(row):
    #All rows with hidden data
    dataFields = row.find_all('td', {'style': True}
    output = {}
    #Fixed index numbers are not ideal but in this case will work
    output['Discipline'] = dataFields[0].text
    output['Cogome'] = dataFields[2].text
    output['Cellulare'] = dataFields[8].text
    output['email'] = dataFields[10].text
    return output

# Declaring a list to store all results
results = []

# Iterating over all the rows and storing the processed result in a list
for row in rows:
    results.append(processRow(row))

print(results)


    click_next_page()
    time.sleep(3)
    count += 1

I think something is not ok. I get a "SyntaxError: invalid syntax" at "output = {}" below # A function to process each row.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM