从抓取的数据中创建具有唯一文件名的多个文本文件

Question

这学期我参加了 Python 入门课程，现在正在尝试做一个项目。 但是，我真的不知道应该编写什么代码来创建多个 .txt 文件，每个文件的标题都不同。

我从网站http://www.hogwartsishere.com/library/book/99/ 上抓取了所有术语和定义。 例如，.txt 文件的标题应为“Aconite.txt”，文件内容应为标题和定义。 每个术语及其定义都可以在单独的 p-tag 中找到，并且术语本身是带有 p-tag 的 b-tag。 我可以用它来写我的代码吗？

我想我需要为此使用 for 循环，但我真的不知道从哪里开始。 我搜索了 StackOverflow 并找到了几个解决方案，但所有解决方案都包含我不熟悉和/或与另一个问题相关的代码。

这是我到目前为止：

#!/usr/bin/env/ python

import requests
import bs4

def download(url):
    r = requests.get(url) 
    html = r.text
    soup = bs4.BeautifulSoup(html, 'html.parser')
    terms_definition = []

    #for item in soup.find_all('p'): #beter definiëren
    items = soup.find_all("div", {"class" : "font-size-16 roboto"})
    for item in items:
        terms = item.find_all("p")
        for term in terms:
            #print(term)
            if term.text is not 'None':
                #print(term.text)
                #print("\n")
                term_split = term.text.split()
                print(term_split)

            if term.text != None and len(term.text) > 1:
                if '-' in term.text.split():
                    print(term.text)
                    print('\n')
                if item.find('p'):
                    terms_definition.append(item['p'])
                print(terms_definition)

    return terms_definition

def create_url(start, end):
    list_url = []
    base_url = 'http://www.hogwartsishere.com/library/book/99/chapter/'
    for x in range(start, end):
        list_url.append(base_url + str(x))

    return list_url

def search_all_url(list_url):
    for url in list_url:
        download(url)

#write data into separate text files. Word in front of the dash should be title of the document, term and definition should be content of the text file
#all terms and definitions are in separate p-tags, title is a b-tag within the p-tag
def name_term



def text_files
    path_write = os.path.join('data', name_term +'.txt') #'term' should be replaced by the scraped terms

    with open(path_write, 'w') as f:
    f.write()
#for loop? in front of dash = title / everything before and after dash = text (file content) / empty line = new file



if __name__ == '__main__':
    download('http://www.hogwartsishere.com/library/book/99/chapter/1')
    #list_url = create_url(1, 27)
    #search_all_url(list_url)

提前致谢！

Answer 1

您可以遍历所有页面 ( 1-27 ) 以获取其内容，然后使用bs4解析每个页面，然后将结果保存到文件：

import requests
import bs4
import re

for i in range(1, 27):
    r = requests.get('http://www.hogwartsishere.com/library/book/99/chapter/{}/'.format(i)).text
    soup = bs4.BeautifulSoup(r, 'html.parser')
    items = soup.find_all("div", {"class": "font-size-16 roboto"})
    for item in items:
        terms = item.find_all("p")
        for term in terms:
            title = re.match('^(.*) -', term.text).group(1).replace('/', '-')
            with open(title + '.txt', 'w', encoding='utf-8') as f:
                f.write(term.text)

输出文件：

从抓取的数据中创建具有唯一文件名的多个文本文件

问题描述

1 个解决方案

解决方案1
0 2020-01-16 11:15:10

从抓取的数据中创建具有唯一文件名的多个文本文件

问题描述

1 个解决方案

解决方案1 0 2020-01-16 11:15:10

解决方案1
0 2020-01-16 11:15:10