简体   繁体   English

从具有重复嵌套模式的文本文件中提取文本

[英]Extracting text from text file with recurring nested pattern

I am struggling to extract text from a file.我正在努力从文件中提取文本。 The text is in the following format with [] signifying a delimiter.文本采用以下格式,[] 表示分隔符。

File Text:文件文本:

[Dataset 1] "text" [Filename 1] "text" [Filename 2] "text" [Key Data Delimiter] !key data! [数据集 1]“文本”[文件名 1]“文本”[文件名 2]“文本”[关键数据分隔符]!关键数据! [Key Data Delimiter] "text" [Filename 3] "text" [Dataset 2] "text" [Filename 1] [Key Data Delimiter] key data [Key Data Delimeter] "text" [Filename 2] [Dataset 3]... [Key Data Delimiter] "text" [Filename 3] "text" [Dataset 2] "text" [Filename 1] [Key Data Delimiter] key data [Key Data Delimiter] "text" [Filename 2] [Dataset 3]。 ..

Desired Output:期望的输出:

[Dataset 1], [Filename 2], !key data!.
[Dataset 2], [Filename 1], !key data!.

With the filename being after which filename the key delimiter appears and before another Dataset.文件名位于密钥分隔符出现的文件名之后和另一个数据集之前。 There is only one file containing key data per Dataset.每个数据集只有一个包含关键数据的文件。

f = open('file.txt', 'r')
TextBetween_KeyDataDelimeter = re.findall('KeyDataDelimeter(.+?)KeyDataDelimiter',f.read(), re.DOTALL)

I'm thinking of nested for loops with if/else statements but that seems quite messy.我正在考虑带有 if/else 语句的嵌套 for 循环,但这似乎很混乱。 Can someone please point me to docs I should read to help me out.有人可以指点我应该阅读的文档来帮助我。

Here's an option without regex, just some string and list manipulations.这是一个没有正则表达式的选项,只是一些字符串和列表操作。 Somewhat convoluted, but it works:有点令人费解,但它有效:

kds = """[Dataset 1] "text1" [Filename 1] "text2" [Filename 2] "text3" [Key Data Delimiter] !key data1![Key Data Delimiter] "text4" [Filename 3] "text5" [Dataset 2] "text6" [Filename 1] [Key Data Delimiter] key data2 [Key Data Delimeter] "text7" [Filename 2]"""

# split the text file into datasets
nkds = kds.replace('[Dataset','xxx[Dataset').split('xxx')

for k in nkds[1:]:
    entry = ''
    #split each dataset into components
    nk = k.replace('[','xxx[').split('xxx')[1:]
    #get the name of the dataset
    entry+= nk[0].replace(']',']xxx').split('xxx')[0]
    for k in nk:
        #find the index position of the delimiter in the dataset list
        if '[Key Data Delimiter]' in k:
            #get the previous index position for the file name
            file_ind = nk.index(k)-1
            entry+= nk[file_ind].replace(']',']xxx').split('xxx')[0]
            entry+= k.split(']')[1].strip()
            break
    print(entry)

Output:输出:

[Dataset 1][Filename 2]!key data1!
[Dataset 2][Filename 1]key data2

With re.findall function, would you please try:使用re.findall功能,请您尝试:

import re

with open('file.txt') as f:
    for line in f:
        m = re.findall(r'(\[Dataset\b[^][]*]).*?(\[Filename\b[^]]*])[^[]*\[Key Data Delimiter\](.*?)\[Key Data Delimiter]', line)
        print([x for i in m for x in i])        # flatten list of tuples

Output:输出:

['[Dataset 1]', '[Filename 2]', ' !key data! ', '[Dataset 2]', '[Filename 1]', ' key data ']

The regex matches the dataset , the filename being after which filename the key delimiter appears, and the key data surrounded by the key delimiters.正则表达式匹配datasetfilename在其后出现键分隔符的文件名,以及由键分隔符包围的key data

The result is purposely flattened to meet the desired output but it might be better to keep the original 2-d structure depending on the usage.结果被故意展平以满足所需的输出,但根据使用情况保留原始二维结构可能会更好。

BTW your file.txt has a typo in the 2nd dataset as Key Data Delimeter .顺便说一句,您的file.txt在第二个数据集中作为Key Data Delimeter有错字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM