使用Python和RegEx从多个.txt文件中提取某些数据

Question

I have several .txt files and I need to extract certain data from them. 我有几个.txt文件，我需要从中提取某些数据。 Files looks similar, but each of them stores different data. 文件看起来很相似，但是每个文件存储不同的数据。 Here is an example of that file: 这是该文件的示例：

Start Date:        21/05/2016
Format:            TIFF
Resolution:        300dpi
Source:            X Company
...

There is more information in the text files, but I need to extract the start date, format and the resolution. 文本文件中有更多信息，但我需要提取开始日期，格式和分辨率。 Files are in the same parent directory ("E:\\Images") but each file has its own folder. 文件位于同一父目录（“ E：\\ Images”）中，但是每个文件都有其自己的文件夹。 Therefore I need a script for recursive reading of these files. 因此，我需要一个脚本来递归读取这些文件。 Here is my script so far: 到目前为止，这是我的脚本：

#importing a library
import os

#defining location of parent folder
BASE_DIRECTORY = 'E:\Images'

#scanning through subfolders
    for dirpath, dirnames, filenames in os.walk(BASE_DIRECTORY):
        for filename in filenames:

        #defining file type
        txtfile=open(filename,"r")
        txtfile_full_path = os.path.join(dirpath, filename)
        try:
            for line in txtfile:

                if line.startswidth('Start Date:'):
                start_date = line.split()[-1]

                elif line.startswidth('Format:'):
                data_format = line.split()[-1]

                elif line.startswidth('Resolution:'):
                resolution = line.split()[-1]

                    print(
                    txtfile_full_path,
                    start_date,
                    data_format,
                    resolution)

Ideally it might be better if Python extracts it together with a name of ech file and saves it in a text file. 理想情况下，如果Python将其与ech文件的名称一起提取并将其保存在文本文件中，可能会更好。 Because I don't have much experience in Python, I don't know how to progress any further. 因为我没有Python的丰富经验，所以我不知道如何进一步发展。

Answer 1

Here is the code I've used: 这是我使用的代码：

# importing libraries
import os

# defining location of parent folder
BASE_DIRECTORY = 'E:\Images'
output_file = open('output.txt', 'w')
output = {}
file_list = []

# scanning through sub folders
for (dirpath, dirnames, filenames) in os.walk(BASE_DIRECTORY):
    for f in filenames:
        if 'txt' in str(f):
            e = os.path.join(str(dirpath), str(f))
            file_list.append(e)

for f in file_list:
    print f
    txtfile = open(f, 'r')
    output[f] = []
    for line in txtfile:
        if 'Start Date:' in line:
            output[f].append(line)
        elif 'Format' in line:
            output[f].append(line)
        elif 'Resolution' in line:
            output[f].append(line)
tabs = []
for tab in output:
    tabs.append(tab)

tabs.sort()
for tab in tabs:
    output_file.write(tab + '\n')
    output_file.write('\n')
    for row in output[tab]:
        output_file.write(row + '')
    output_file.write('\n')
    output_file.write('----------------------------------------------------------\n')

raw_input()

Answer 2

You do not need regular expressions. 您不需要正则表达式。 You can use basic string functions: 您可以使用基本的字符串函数：

   txtfile=open(filename,"r")
   for line in txtfile:
         if line.startswidth("Start Date:"):
             start_date = line.split()[-1]
         ...

break if you have all information collected. 如果您收集了所有信息，请break 。

Answer 3

To grab the Start Date , you can use the following regex: 要获取Start Date ，可以使用以下正则表达式：

^(?:Start Date:)\D*(\d+/\d+/\d+)$
# ^ anchor the regex to the start of the line
# capture the string "Start Date:" in a group
# followed by non digits zero or unlimited times 
# followed by a group with the start date in it

In Python this would be: 在Python这将是：

import re

regex = r"^(?:Start Date:)\D*(\d+/\d+/\d+)$"

# the variable line points to your line in the file
if re.search(regex, line):
    # do sth. useful here

See a demo on regex 101 . 参见正则表达式101上的演示。

使用Python和RegEx从多个.txt文件中提取某些数据

问题描述

3 个解决方案

解决方案1
1 已采纳 2016-01-21 10:51:23

解决方案2
0 2016-01-19 13:31:35

解决方案3
0 2016-01-19 14:05:37

使用Python和RegEx从多个.txt文件中提取某些数据

问题描述

3 个解决方案

解决方案1 1 已采纳 2016-01-21 10:51:23

解决方案2 0 2016-01-19 13:31:35

解决方案3 0 2016-01-19 14:05:37

解决方案1
1 已采纳 2016-01-21 10:51:23

解决方案2
0 2016-01-19 13:31:35

解决方案3
0 2016-01-19 14:05:37