简体   繁体   中英

Parsing HTML source code using AppleScript

I'm trying to parse an HTML file which I have converted to a TXT file inside of Automator.

I previously downloaded the HTML file from a website using Automator, and I am now struggling to parse the source code.

Preferably, I want to take the information of just the table and I need to repeat this action for 1800 different HTML files.

Here is an example of the source code:

</head>
<body>
<div id="header">
    <div class="wrapper">
        <span class="access">
        <div id="fb-root"></div>


    <span class="access">
     Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a>       Logged in as Edward&nbsp;&nbsp; | &nbsp;&nbsp;<a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a>

    </span>
                                    </span>
    </div><!-- /wrapper -->
</div><!-- /header -->

<div id="masthead">
    <div class="wrapper">   
        <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
        <div id="navigation">
            <ul>
<li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li>    <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>               
        </div><!-- /navigation -->

    </div><!-- /wrapper -->     
</div><!-- /masthead -->


<div id="content">
    <div class="wrapper">
        <div id="main-content">

 <!-- per Project stuff -->
    <span class="section">
                <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
                <h1><span id="profile-name-104947" >Christian Sieling</span></h1>
                                    <ul class="gbutton-group right">
                    <li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">&laquo; Back </a></li>
                    <li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752"  id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>
                </ul>

                <div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
                <span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/>
                <a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
                </div>
                                    <h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>

            </span>

            <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
                                                        <tr>
                    <th>Role</th>
                    <td>
                    <p>Other</p>                            </td>
                </tr>
                <tr>  
                    <th>Organisation Type</th>
                    <td>
                    <p>Asset Manager</p>                        </td>
                </tr>
                <tr>
                    <th>Email</th>
                    <td><a href="mailto:cs@lumixcapital.com" title="cs@lumixcapital.com" >cs@lumixcapital.com</a></td>
                </tr>
                <tr>
                    <th>Website</th>
                    <td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
                </tr>
                <tr>
                    <th>Phone</th>
                    <td>41 78 616 7334</td>
                </tr>
                <tr>
                    <th>Fax</th>
                    <td></td> 
                </tr>
                <tr>
                    <th>Mailing Address</th>
                    <td>Birrenstrasse 30</td>
                </tr>
                <tr>
                    <th>City</th>
                    <td>Schindellegi</td>
                </tr>
                <tr>
                    <th>State</th>
                    <td>CH</td>
                </tr>
                <tr>
                    <th>Country</th>
                    <td>Switzerland</td>
                </tr>
                <tr>
                    <th class="lastrow" >Zip/ Postal Code</th>
                    <td class="lastrow" >8834</td>
                </tr>
        </table>
                </div><!-- /main-content -->
                    <div id="sidebar"  >
                    </div>

            <div id="similar_sidebar" class="similar_refine" >



            </div>
                            </div><!-- /wrapper -->
</div><!-- /content -->

<div id="footer">

</div>

My AppleScript attempt that is using text item delimiters to extract the table in a similar fashion:

set p to input
set ex to extractBetween(p, "<table>", "</table>") -- extract the URL
to extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText
set endItems to text of text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text of text item 1 of endItems
set AppleScript's text item delimiters to tid
return beginningToEnd
end extractBetween

How can I parse the table from the HTML file?

You're really close. The problem is your startText variable. The starting table tag is not in the html text so it can't be found. The line that starts the table is actually...

<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">

So I modified your code to look for that tag in 2 steps. First...

<table

And then this separately...

>

In this way we can ignore all of the code that comes with the table tag (width, border etc.) because I assume it will vary between the files. After doing this we get only the code of the table. Try this...

set p to input
set ex to extractBetween(p, "<table", ">", "</table>")

to extractBetween(SearchText, startText1, startText2, endText)
    set tid to AppleScript's text item delimiters
    set AppleScript's text item delimiters to startText1
    set endItems to text item -1 of SearchText
    set AppleScript's text item delimiters to endText
    set beginningToEnd to text item 1 of endItems
    set AppleScript's text item delimiters to startText2
    set finalText to (text items 2 thru -1 of beginningToEnd) as text
    set AppleScript's text item delimiters to tid
    return finalText
end extractBetween

Rather than make your own HTML parser, you can exploit the HTML parser in Safari via the do javascript command. JavaScript has built-in functionality for working with HTML elements and data.

This script gets the HTML for just the first table in a page:

tell application "Safari"
    tell document 1
        set theFirstTableHTML to do JavaScript "document.getElementsByTagName('table')[0].innerHTML"
    end tell
end tell

You can use this technique to apply basic DOM Scripting to any page and grab out any data that you want to read out. You can get just the values of the table cells, or whatever you want.

Try:

set xxx to read alias "Mac OS X:Users:paolo:Desktop:paolo.html"
set yyy to do shell script "echo " & quoted form of xxx & " | grep -o \\<table.*table\\>"

One-line wonder that works:

tell application "Safari" to set sourceCode to characters (offset of <table in (source of document 1 as string)) thru ((offset of "/table" in (source of document 1 as string)) + (count of "/table")) of (source of document 1 as string) as string

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM