[英]Creating multiple text files with unique file names from scraped data
這學期我參加了 Python 入門課程,現在正在嘗試做一個項目。 但是,我真的不知道應該編寫什么代碼來創建多個 .txt 文件,每個文件的標題都不同。
我從網站http://www.hogwartsishere.com/library/book/99/ 上抓取了所有術語和定義。 例如,.txt 文件的標題應為“Aconite.txt”,文件內容應為標題和定義。 每個術語及其定義都可以在單獨的 p-tag 中找到,並且術語本身是帶有 p-tag 的 b-tag。 我可以用它來寫我的代碼嗎?
我想我需要為此使用 for 循環,但我真的不知道從哪里開始。 我搜索了 StackOverflow 並找到了幾個解決方案,但所有解決方案都包含我不熟悉和/或與另一個問題相關的代碼。
這是我到目前為止:
#!/usr/bin/env/ python
import requests
import bs4
def download(url):
r = requests.get(url)
html = r.text
soup = bs4.BeautifulSoup(html, 'html.parser')
terms_definition = []
#for item in soup.find_all('p'): #beter definiëren
items = soup.find_all("div", {"class" : "font-size-16 roboto"})
for item in items:
terms = item.find_all("p")
for term in terms:
#print(term)
if term.text is not 'None':
#print(term.text)
#print("\n")
term_split = term.text.split()
print(term_split)
if term.text != None and len(term.text) > 1:
if '-' in term.text.split():
print(term.text)
print('\n')
if item.find('p'):
terms_definition.append(item['p'])
print(terms_definition)
return terms_definition
def create_url(start, end):
list_url = []
base_url = 'http://www.hogwartsishere.com/library/book/99/chapter/'
for x in range(start, end):
list_url.append(base_url + str(x))
return list_url
def search_all_url(list_url):
for url in list_url:
download(url)
#write data into separate text files. Word in front of the dash should be title of the document, term and definition should be content of the text file
#all terms and definitions are in separate p-tags, title is a b-tag within the p-tag
def name_term
def text_files
path_write = os.path.join('data', name_term +'.txt') #'term' should be replaced by the scraped terms
with open(path_write, 'w') as f:
f.write()
#for loop? in front of dash = title / everything before and after dash = text (file content) / empty line = new file
if __name__ == '__main__':
download('http://www.hogwartsishere.com/library/book/99/chapter/1')
#list_url = create_url(1, 27)
#search_all_url(list_url)
提前致謝!
您可以遍歷所有頁面 ( 1-27
) 以獲取其內容,然后使用bs4
解析每個頁面,然后將結果保存到文件:
import requests
import bs4
import re
for i in range(1, 27):
r = requests.get('http://www.hogwartsishere.com/library/book/99/chapter/{}/'.format(i)).text
soup = bs4.BeautifulSoup(r, 'html.parser')
items = soup.find_all("div", {"class": "font-size-16 roboto"})
for item in items:
terms = item.find_all("p")
for term in terms:
title = re.match('^(.*) -', term.text).group(1).replace('/', '-')
with open(title + '.txt', 'w', encoding='utf-8') as f:
f.write(term.text)
輸出文件:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.