简体   繁体   English

如何从多个文件夹和文件中读取特定段落

[英]How to read a specific paragraph from from multiple folders and files

I have a list that contains directories and filenames that I want to open, read a paragraph from and save that paragraph to a list.我有一个列表,其中包含要打开的目录和文件名,从中读取一段并将该段保存到列表中。

The problem is that I don't know how to "filter" the paragraph out from the files and insert into my list.问题是我不知道如何从文件中“过滤”出段落并插入到我的列表中。

My code so far.我的代码到目前为止。

rr = []
file_list = [f for f in iglob('**/README.md', recursive=True) if os.path.isfile(f)]
for f in file_list:
  with open(f,'rt') as fl:
    lines = fl.read()
    rr.append(lines)
  print(rr)

The format of the file I'm trying to read from.我试图读取的文件格式。 The text between the paragraph start and the new paragraph is what I'm looking for段落开头和新段落之间的文本是我要找的

There is text above this paragraph
## Required reading
    * line
    * line
    * line
     /n
### Supplementary reading
There is text bellow this paragraph

When I run the code I get all the lines from the files as expected.当我运行代码时,我按预期从文件中获取了所有行。

You need to learn how your imported text is structured.您需要了解导入文本的结构。 How are the paragraphs segregated?段落是如何划分的? does it look like '\\n\\n', could you split your text file on '\\n\\n' and return the index of the paragraph you want?它看起来像 '\\n\\n',你能把你的文本文件分割成 '\\n\\n' 并返回你想要的段落的索引吗?

text = 'paragraph one text\n\nparagraph two text\n\nparagraph three text'.split('\n\n')[1]
print(text)
>>> 'paragraph two text'

The other option, as someone else mentioned is Regular Expression aka RegEx, which you can import using另一个选项,正如其他人提到的,正则表达式又名 RegEx,您可以使用它导入

import re

RegEx is used to find patterns in text. RegEx 用于在文本中查找模式。

Go to https://pythex.org/ and grab a sample of one of the documents and experiment findingthe pattern that will match with the paragraph you want to find.转到https://pythex.org/并获取其中一个文档的样本并尝试找到与您要查找的段落匹配的模式。

Learn more about RegEx here https://regexone.com/references/python在此处了解有关 RegEx 的更多信息https://regexone.com/references/python

Solved my problem with string slicing.解决了我的字符串切片问题。

Basically, I just scan each line for a start String and an end String and makes lines out of that.基本上,我只是扫描每一行的起始字符串和结束字符串,然后从中生成行。 These lines then get appended to a list and written into a file.然后将这些行附加到列表并写入文件。

for f in file_list:
        with open(f, 'rt') as fl:
            lines = fl.read()
            lines = lines[lines.find('## Required reading'):lines.find('## Supplementary reading')]
            lines = lines[lines.find('## Required reading'):lines.find('### Supplementary reading')]
            lines = lines[lines.find('## Required reading'):lines.find('## Required reading paragraph')]
            rr.append(lines)

But I still have "## Required reading" in my list and in my file so I run a second read/write method.但是我的列表和文件中仍然有“## 必读”,所以我运行了第二个读/写方法。

def removeHashTag():
    global line
    f = open("required_reading.md", "r")
    lines = f.readlines()
    f.close()
    f = open("required_reading.md", "w")
    for line in lines:
        if line != "## Required reading" + "\n":
            f.write(line)
    f.close()
removeHashTag()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM