从抓取的数据中创建具有唯一文件名的多个文本文件

Question

I took an introductory course in Python this semester and am now trying to do a project.这学期我参加了 Python 入门课程，现在正在尝试做一个项目。 However, I don't really know what code I should write to create multiple .txt files of which the title will be different for each file.但是，我真的不知道应该编写什么代码来创建多个 .txt 文件，每个文件的标题都不同。

I scraped all the terms and definitions from the website http://www.hogwartsishere.com/library/book/99/ .我从网站http://www.hogwartsishere.com/library/book/99/ 上抓取了所有术语和定义。 Title of the .txt file should for example be 'Aconite.txt' and the content of the file should be the title and the definition.例如，.txt 文件的标题应为“Aconite.txt”，文件内容应为标题和定义。 Every term with its definition can be found in a separate p-tag and the term itself is a b-tag withing the p-tag.每个术语及其定义都可以在单独的 p-tag 中找到，并且术语本身是带有 p-tag 的 b-tag。 Can I use this to write my code?我可以用它来写我的代码吗？

I suppose I will need to use a for-loop for this, but I don't really know where to start.我想我需要为此使用 for 循环，但我真的不知道从哪里开始。 I searched StackOverflow and found several solutions, but all of them contain code I am not familiar with and/or relate to another issue.我搜索了 StackOverflow 并找到了几个解决方案，但所有解决方案都包含我不熟悉和/或与另一个问题相关的代码。

This is what I have so far:这是我到目前为止：

#!/usr/bin/env/ python

import requests
import bs4

def download(url):
    r = requests.get(url) 
    html = r.text
    soup = bs4.BeautifulSoup(html, 'html.parser')
    terms_definition = []

    #for item in soup.find_all('p'): #beter definiëren
    items = soup.find_all("div", {"class" : "font-size-16 roboto"})
    for item in items:
        terms = item.find_all("p")
        for term in terms:
            #print(term)
            if term.text is not 'None':
                #print(term.text)
                #print("\n")
                term_split = term.text.split()
                print(term_split)

            if term.text != None and len(term.text) > 1:
                if '-' in term.text.split():
                    print(term.text)
                    print('\n')
                if item.find('p'):
                    terms_definition.append(item['p'])
                print(terms_definition)

    return terms_definition

def create_url(start, end):
    list_url = []
    base_url = 'http://www.hogwartsishere.com/library/book/99/chapter/'
    for x in range(start, end):
        list_url.append(base_url + str(x))

    return list_url

def search_all_url(list_url):
    for url in list_url:
        download(url)

#write data into separate text files. Word in front of the dash should be title of the document, term and definition should be content of the text file
#all terms and definitions are in separate p-tags, title is a b-tag within the p-tag
def name_term



def text_files
    path_write = os.path.join('data', name_term +'.txt') #'term' should be replaced by the scraped terms

    with open(path_write, 'w') as f:
    f.write()
#for loop? in front of dash = title / everything before and after dash = text (file content) / empty line = new file



if __name__ == '__main__':
    download('http://www.hogwartsishere.com/library/book/99/chapter/1')
    #list_url = create_url(1, 27)
    #search_all_url(list_url)

Thanks in advance!提前致谢！

Answer 1

You can iterate over all pages ( 1-27 ) to get its content, then parse each page with bs4 and then save results to files:您可以遍历所有页面 ( 1-27 ) 以获取其内容，然后使用bs4解析每个页面，然后将结果保存到文件：

import requests
import bs4
import re

for i in range(1, 27):
    r = requests.get('http://www.hogwartsishere.com/library/book/99/chapter/{}/'.format(i)).text
    soup = bs4.BeautifulSoup(r, 'html.parser')
    items = soup.find_all("div", {"class": "font-size-16 roboto"})
    for item in items:
        terms = item.find_all("p")
        for term in terms:
            title = re.match('^(.*) -', term.text).group(1).replace('/', '-')
            with open(title + '.txt', 'w', encoding='utf-8') as f:
                f.write(term.text)

Output files:输出文件：

从抓取的数据中创建具有唯一文件名的多个文本文件

问题描述

1 个解决方案

解决方案1
0 2020-01-16 11:15:10

从抓取的数据中创建具有唯一文件名的多个文本文件

问题描述

1 个解决方案

解决方案1 0 2020-01-16 11:15:10

解决方案1
0 2020-01-16 11:15:10