[英]Extract certain data from multiple .txt files using Python and RegEx
I have several .txt files and I need to extract certain data from them. 我有几个.txt文件,我需要从中提取某些数据。 Files looks similar, but each of them stores different data.
文件看起来很相似,但是每个文件存储不同的数据。 Here is an example of that file:
这是该文件的示例:
Start Date: 21/05/2016
Format: TIFF
Resolution: 300dpi
Source: X Company
...
There is more information in the text files, but I need to extract the start date, format and the resolution. 文本文件中有更多信息,但我需要提取开始日期,格式和分辨率。 Files are in the same parent directory ("E:\\Images") but each file has its own folder.
文件位于同一父目录(“ E:\\ Images”)中,但是每个文件都有其自己的文件夹。 Therefore I need a script for recursive reading of these files.
因此,我需要一个脚本来递归读取这些文件。 Here is my script so far:
到目前为止,这是我的脚本:
#importing a library
import os
#defining location of parent folder
BASE_DIRECTORY = 'E:\Images'
#scanning through subfolders
for dirpath, dirnames, filenames in os.walk(BASE_DIRECTORY):
for filename in filenames:
#defining file type
txtfile=open(filename,"r")
txtfile_full_path = os.path.join(dirpath, filename)
try:
for line in txtfile:
if line.startswidth('Start Date:'):
start_date = line.split()[-1]
elif line.startswidth('Format:'):
data_format = line.split()[-1]
elif line.startswidth('Resolution:'):
resolution = line.split()[-1]
print(
txtfile_full_path,
start_date,
data_format,
resolution)
Ideally it might be better if Python extracts it together with a name of ech file and saves it in a text file. 理想情况下,如果Python将其与ech文件的名称一起提取并将其保存在文本文件中,可能会更好。 Because I don't have much experience in Python, I don't know how to progress any further.
因为我没有Python的丰富经验,所以我不知道如何进一步发展。
Here is the code I've used: 这是我使用的代码:
# importing libraries
import os
# defining location of parent folder
BASE_DIRECTORY = 'E:\Images'
output_file = open('output.txt', 'w')
output = {}
file_list = []
# scanning through sub folders
for (dirpath, dirnames, filenames) in os.walk(BASE_DIRECTORY):
for f in filenames:
if 'txt' in str(f):
e = os.path.join(str(dirpath), str(f))
file_list.append(e)
for f in file_list:
print f
txtfile = open(f, 'r')
output[f] = []
for line in txtfile:
if 'Start Date:' in line:
output[f].append(line)
elif 'Format' in line:
output[f].append(line)
elif 'Resolution' in line:
output[f].append(line)
tabs = []
for tab in output:
tabs.append(tab)
tabs.sort()
for tab in tabs:
output_file.write(tab + '\n')
output_file.write('\n')
for row in output[tab]:
output_file.write(row + '')
output_file.write('\n')
output_file.write('----------------------------------------------------------\n')
raw_input()
You do not need regular expressions. 您不需要正则表达式。 You can use basic string functions:
您可以使用基本的字符串函数:
txtfile=open(filename,"r")
for line in txtfile:
if line.startswidth("Start Date:"):
start_date = line.split()[-1]
...
break
if you have all information collected. 如果您收集了所有信息,请
break
。
To grab the Start Date
, you can use the following regex: 要获取
Start Date
,可以使用以下正则表达式:
^(?:Start Date:)\D*(\d+/\d+/\d+)$
# ^ anchor the regex to the start of the line
# capture the string "Start Date:" in a group
# followed by non digits zero or unlimited times
# followed by a group with the start date in it
In Python
this would be: 在
Python
这将是:
import re
regex = r"^(?:Start Date:)\D*(\d+/\d+/\d+)$"
# the variable line points to your line in the file
if re.search(regex, line):
# do sth. useful here
See a demo on regex 101 . 参见正则表达式101上的演示 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.