![](/img/trans.png)
[英]How to download the excel file which has no link in Beautifulsoup?
[英]How to #document frames in beautifulsoup that has microsoft excel schema?
正如標題所說,我正在抓取一個包含一組學校列表的網站。 單擊它,將您重定向到另一個使用xmlns:urn:schemas-microsoft-com:office:excel
的.htm
網站。
我想要的只是訪問學校的名稱 email 及其網站,我相信我可以自己完成,稍后我會將其導出到 csv 文件中。 但問題是,我無法以任何方式訪問該表,並且嘗試給我None
作為 output。
主要網站: https://myschoolchildren.com/list-of-all-secondary-schools-in-malaysia/#.YzWrtXZBy3A該網站的第一個鏈接: https://myschoolchildren.com/data/SEK_MEN_Johor.htm
到目前為止,這是我的工作(整個代碼已共享):
import requests
from bs4 import BeautifulSoup
def write(file_name, data_type):
with open(file_name, "a") as requirement:
requirement.write("%s\n" % data_type)
def url_parser(url):
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'html.parser')
return soup
def lxml_url_parser(url):
html_doc = requests.get(url)
soup = BeautifulSoup(html_doc.text, 'lxml')
return soup
def data_fetch(url):
soup = url_parser(url)
links = soup.find(class_='entry-content').find_all('a')
for link in links:
web = link.get('href')
soup2 = lxml_url_parser(web)
#school_name = soup2.find('tbody').find_all('tr')
print(soup2)
#print(school_name)
break
def main():
url = "https://myschoolchildren.com/list-of-all-secondary-schools-in-malaysia/#.YzWrtXZBy3A"
data_fetch(url)
if __name__ == "__main__":
main()
我不知道我哪里出錯了。我只想要學校的名字,email 和學校的網站。 有什么建議么?
嘗試改變
html_doc = requests.get(url)
到
html_doc = requests.get(url.replace('.htm', '_files/sheet001.htm'))
當頁面加載時,表就是從這里動態加載的
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.