简体   繁体   English

如何从文本文件中提取某些段落

[英]How to extract certain paragraph from text file

    def extract_book_info(self):
        books_info = []
        for file in os.listdir(self.book_folder_path):
            title = "None"
            author = "None"
            release_date = "None"
            last_update_date = "None"
            language = "None"
            producer = "None"

            with open(self.book_folder_path + file, 'r', encoding = 'utf-8') as content:
                book_info = content.readlines()
                for lines in book_info:
                    if lines.startswith('Title'):
                        title = lines.strip().split(': ')
                       
                    elif lines.startswith('Author'):
                        try:
                            author = lines.strip().split(': ')
                           
                        except IndexError:
                            author = 'Empty'
                    elif lines.startswith('Release date'):
                        release_date = lines.strip().split(': ')
                         
                    elif lines.startswith('Last updated'):
                        last_update_date = lines.strip().split(': ')
                       
                    elif lines.startswith('Produce by'):
                        producer = lines.strip().split(': ')
                       
                    elif lines.startswith('Language'):
                        language = lines.strip().split(': ')
                        
                    elif lines.startswith('***'):
                        pass
                        

                books_info.append(Book(title, author, release_date, last_update_date, producer, language, self.book_folder_path))

        with open(self.book_info_path, 'w', encoding="utf-8") as book_file:
            for book_info in books_info:
                book_file.write(book_info.__str__() + "\n")

I was using this code tried to extract the book title, author, release_date, last_update_date, language, producer, book_path).我正在使用这段代码试图提取书名、作者、release_date、last_update_date、语言、制作人、book_path)。

This the the output I achieve:这是我实现的output:

['Title', 'The Adventures of Sherlock Holmes'];;;['Author', 'Arthur Conan Doyle'];;;None;;;None;;;None;;;['Language', 'English'];;;data/books_data/;;;

This is the output I should achieved.这是我应该实现的 output。

May I know what method I should used to achieve the following output请问我应该使用什么方法来实现以下output

The Adventures of Sherlock Holmes;;;Arthur Conan Doyle;;;November29,2002;;;May20,2019;;;English;;;

This is the example of input:这是输入的示例:

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle

Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: May 20, 2019]

Language: English

Character set encoding: UTF-8

Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez

*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***

cover

str.split gives you a list as a result. str.split结果给你一个列表。 You're using it to assign to a single value instead.您正在使用它来分配给单个值。

'Title: Sherlock Holmes'.split(':')  # => ['Title', 'Sherlock Holmes']

What I can gather from your requirement you want to access the second element from the split every time.我可以从您的要求中收集到您希望每次都访问split中的第二个元素。 You can do so by:你可以这样做:

...
for lines in book_info:
    if lines.startswith('Author'):
        _, author = lines.strip().split(':')
    
    elif...

Be careful since this can throw an IndexError if there is no second element in a split result.请小心,因为如果split结果中没有第二个元素,这可能会引发IndexError (That's why there's a try on the author param in your code) (这就是为什么在你的代码中try作者参数)

Also, avoid calling __str__ directly.另外,避免直接调用__str__ That's what the str() function calls for you anyway.无论如何,这就是str() function 对您的要求。 Use that instead.改用那个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM