简体   繁体   中英

Extracting href links from within website source w/ Python

I've asked this question before to no avail. I am trying to figure out how to implement bs4 to grab the links to be used for downloading from within the website's source. The problem I can't figure out is the links are within a dynamic content library. I've removed previous html snippet, look below

We've been able to grab the links with this script only after manually grabbing the source code from the website:

import re
enter code here

line = line.rstrip()
x = re.findall('href=[\'"]?([^\'" >]+)tif', line)
if len(x) > 0 :
    result.write('tif">link</a><br>\n<a href="'.join(x))

 `result.write('tif">link</a><br>\n\n</html>\n</body>\n')

result.write("There are " + len(x) + " links")       


print "Download HTML page created."

But only after going into the website ctrl + a -> view source -> select all & copy -> paste onto SourceCode.txt. I would like to remove the manual labor from all this.

I'd greatly appreciate any information/tips/advice!

EDIT

I wanted to add some more information regarding the website we are using, the Library content will only show up when it has been manually expanded. Otherwise, the content (ie, the download links/href *.tif) are not visible. Here's an example of what we see:

Source Code of site without opening the library element.

<html><body>

Source Code after opening library element.

<html><body>
<h3>Library</h3>
<div id="libraryModalBody">

    <div><table><tbody>

    <tr>
    <td>Tile12</td>
    <td><a href="http://www.website.com/path/Tile12.zip">Button</a></td>
    </tr>

    </tbody></table></div>

</div> 

Source code after expanding all download options.

<html><body>
<h3>Library</h3>
<div id="libraryModalBody">
    <div><table><tbody>
    <tr>
    <td>Tile12</td>
    <td><a href="http://www.website.com/path/Tile12.zip">Button</a></td>
    </tr>
    <tr>
    <td>Tile12_Set1.tif</td>
    <td><a href="http://www.website.com/path/Tile12_Set1.tif">Button</a></td>
    </tr>
    <tr>
    <td>Tile12_Set2.tif</td>
    <td><a href="http://www.website.com/path/Tile12_Set2.tif">Button</a></td>
    </tr>
    </tbody></table></div>
</div>

Our end goal would be to grab the downloads link with only having to input the website url. The issue seems to be in the way the content is displayed (ie, dynamic content only visible after manual expansion of the library.

Do not try and parse HTML with regular expressions. It's not possible and it won't work . Use BeautifulSoup4 instead:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = "http://www.your-server.com/page.html"
document = urlopen(url)
soup = BeautifulSoup(document)

# look for all URLs:
found_urls = [link["href"] for link in soup.find_all("a", href=True)]

# look only for URLs to *.tif files:
found_tif_urls = [link["href"] for link in soup.find_all("a", href=True) if link["href"].endswith(".tif")]

You may as well take a look at PyQuery library, which uses the (sub)set of CSS selectors from JQuery:

pq = PyQuery(body)
pq('div.content div#filter-container div.filter-section')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM