![](/img/trans.png)
[英]How to find all the links in a webpage that has a specific file extension?
[英]Python: How to access a webpage, click specific links and copy the data within them to a text file?
我對python和編程很陌生,我所知道的就是為我的日常辦公室工作編寫簡單的腳本。 但是,我遇到了一個場景,我必須使用 python 訪問特定網頁,這是特定生物信息學 Web 服務器的搜索輸出。
在那個網頁中,有一個表格,其中第二列是一個超鏈接,它打開一個帶有蛋白質序列 FASTA 文件的小彈出框。
我希望能夠編寫一個腳本,系統地依次單擊這些鏈接,復制每個鏈接的 FASTA 序列,並將它們粘貼到文本文件中。
python可以實現這種自動化嗎? 如果是這樣,就訪問 Internet Explorer/網頁等的模塊而言,我從哪里開始? 如果您可以指導我正確的方向或給我一個示例腳本,我可以嘗試自己做!
非常感謝!
我會發布我嘗試過的內容,但我真的不知道從哪里開始!
這對我來說大約需要一分半鍾的時間,然后它會打開一個包含序列的文本文件。 您當然需要在最后添加您的憑據等。
import os
import mechanize
import cookielib
from bs4 import BeautifulSoup
from urlparse import urljoin
class SequenceDownloader(object):
def __init__(self, base_url, analyzes_page, email, password, result_path):
self.base_url = base_url
self.login_page = urljoin(self.base_url, 'login')
self.analyzes_page = urljoin(self.base_url, analyzes_page)
self.email = email
self.password = password
self.result_path = result_path
self.browser = mechanize.Browser()
self.browser.set_handle_robots(False)
# set cookie
cj = cookielib.CookieJar()
self.browser.set_cookiejar(cj)
def login(self):
self.browser.open(self.login_page)
# select the first (and only) form and log in
self.browser.select_form(nr=0)
self.browser.form['email'] = self.email
self.browser.form['password'] = self.password
self.browser.submit()
def get_html(self, url):
self.browser.open(url)
return self.browser.response().read()
def scrape_overview_page(self, html):
sequences = []
soup = BeautifulSoup(html)
table = soup.find('table', {'class': 'styled data-table'})
table_body = table.find('tbody')
rows = table_body.find_all('tr', {'class': 'search_result'})
for row in rows:
cols = row.find_all('td')
sequence_url = cols[1].a.get('href')
sequence_html = self.get_html(sequence_url)
sequence_soup = BeautifulSoup(sequence_html)
sequence = sequence_soup.find('pre').text
sequences.append(sequence)
return sequences
def save(self, sequences):
with open(result_path, 'w') as f:
for sequence in sequences:
f.write(sequence + '\n')
def get_sequences(self):
self.login()
overview_html = self.get_html(self.analyzes_page)
sequences = self.scrape_overview_page(overview_html)
self.save(sequences)
if __name__ == '__main__':
base_url = r'https://usgene.sequencebase.com'
analyzes_page = 'user/reports/123/analyzes/9876'
email = 'user1998510@gmail.com'
password = 'YourPassword'
result_path = r'C:path\to\result.fasta'
sd = SequenceDownloader(base_url, analyzes_page, email, password, result_path)
sd.get_sequences()
os.startfile(result_path)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.