[英]Python regular expression to get filename in a long path
I need to analysis some log file and it look like below, I would like to retrieve 3 parts of data,我需要分析一些日志文件,如下所示,我想检索 3 部分数据,
I use this regular expression, but it fail to get third part.我使用这个正则表达式,但它无法获得第三部分。
(\d\d:\d\d:\d\d).*(ABC|DEF).*\\(\d\w\.?\w\..*)\soutput.*
Any suggestion will be appreciated.任何建议将不胜感激。
08:38:36 TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-ABC\2C.013000000B.dat output file=c:\local\project1\data\2C.013000000B.dat.ext
06:40:37 TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-ABC\20100722B.TXT output file=c:\local\project1\data\20100722B.TXT.ext
06:40:39 TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-DEF\20100722D1-XYZ.TXT output file=c:\local\project1\data\20100722D1-YFP.TXT.ext
06:40:42 TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-DEF\2C.250B output file=c:\local\project1\data\2C.250B.ext
BR BR
Edward爱德华
Regular expressions are very good at solving problems like this one - ie parsing log file records.正则表达式非常擅长解决这样的问题——即解析日志文件记录。 MarcoS's answer solves your immediate problem nicely. MarcoS 的回答很好地解决了您的直接问题。 However, another approach is to write a (reusable) generalized function which decomposes a log file record into its various components and returns a match object containing all these parsed components.但是,另一种方法是编写一个(可重用的)通用 function,它将日志文件记录分解为其各种组件并返回包含所有这些已解析组件的匹配 object。 Once decomposed, tests can easily be applied to the component parts to check for various requirements (such as the input file path must end in ABC
or DEF
).分解后,可以轻松地将测试应用于组件部分以检查各种要求(例如输入文件路径必须以ABC
或DEF
结尾)。 Here is a python script which has just such a function: decomposeLogEntry()
and demonstrates how to use it to solve your problem at hand:这是一个 python 脚本,它有一个 function: decomposeLogEntry()
并演示如何使用它来解决您手头的问题:
import re
def decomposeLogEntry(text):
r""" Decompose log file entry into its various components.
If text is a valid log entry, return regex match object of
log entry components strings. Otherwise return None."""
return re.match(r"""
# Decompose log file entry into its various components.
^ # Anchor to start of string
(?P<time>\d\d:\d\d:\d\d) # Capture: time
\s+
(?P<modname>\w+?) # Capture module name
\s-\s\[
(?P<msgtype>[^]]+) # Capture message type
\]
(?P<message>[^!]+) # Capture message text
!!\sftp_site=
(?P<ftpsite>\S+?) # Capture ftp URL
\sfile_dir=
(?P<filedir>\S+?) # Capture file directory?
\sinput\sfile=
(?P<infile> # Capture input path and filename
(?P<infilepath>\S+)\\ # Capture input file path
(?P<infilename>[^\s\\]+) # Capture input file filename
)
\soutput\sfile=
(?P<outfile> # Capture input path and filename
(?P<outfilepath>\S+)\\ # Capture output file path
(?P<outfilename>[^\s\\]+) # Capture output file filename
)
\s* # Optional whitespace at end.
$ # Anchor to end of string
""", text, re.IGNORECASE | re.VERBOSE)
# Demonstrate decomposeLogEntry function. Print components of all log entries.
f=open("testdata.log")
mcnt = 0
for line in f:
# Decompose this line into its components.
m = decomposeLogEntry(line)
if m:
mcnt += 1
print "Match number %d" % (mcnt)
print " Time: %s" % m.group("time")
print " Module name: %s" % m.group("modname")
print " Message type: %s" % m.group("time")
print " Message: %s" % m.group("message")
print " FTP site URL: %s" % m.group("ftpsite")
print " Input file: %s" % m.group("infile")
print " Input file path: %s" % m.group("infilepath")
print " Input file name: %s" % m.group("infilename")
print " Output file: %s" % m.group("outfile")
print " Output file path: %s" % m.group("outfilepath")
print " Output file name: %s" % m.group("outfilename")
print "\n",
f.close()
# Next pick out only the desired data.
f=open("testdata.log")
mcnt = 0
matches = []
for line in f:
# Decompose this line into its components.
m = decomposeLogEntry(line)
if m:
# See if this record meets desired requirements
if re.search(r"ABC$|DEF$", m.group("infilepath")):
matches.append(line)
f.close()
print "There were %d matching records" % len(matches)
This function not only picks out the various parts you are interested in, it also validates the input and rejects badly formatted records.这个 function 不仅可以挑选出您感兴趣的各个部分,还可以验证输入并拒绝格式错误的记录。 Once written and debugged, this function can be reused by other programs which need to analyze the log files for other requirements.一旦编写和调试,这个 function 可以被其他需要分析日志文件以满足其他要求的程序重用。
Here is the output from the script when applied to your test data:这是应用到您的测试数据时脚本中的 output:
r"""
Match number 1
Time: 08:38:36
Module name: TestModule
Message type: 08:38:36
Message: result success
FTP site URL: ftp.test.com
Input file: \root\level1\level2-ABC\2C.013000000B.dat
Input file path: \root\level1\level2-ABC
Input file name: 2C.013000000B.dat
Output file: c:\local\project1\data\2C.013000000B.dat.ext
Output file path: c:\local\project1\data
Output file name: 2C.013000000B.dat.ext
Match number 2
Time: 06:40:37
Module name: TestModule
Message type: 06:40:37
Message: result success
FTP site URL: ftp.test.com
Input file: \root\level1\level2-ABC\20100722B.TXT
Input file path: \root\level1\level2-ABC
Input file name: 20100722B.TXT
Output file: c:\local\project1\data\20100722B.TXT.ext
Output file path: c:\local\project1\data
Output file name: 20100722B.TXT.ext
Match number 3
Time: 06:40:39
Module name: TestModule
Message type: 06:40:39
Message: result success
FTP site URL: ftp.test.com
Input file: \root\level1\level2-DEF\20100722D1-XYZ.TXT
Input file path: \root\level1\level2-DEF
Input file name: 20100722D1-XYZ.TXT
Output file: c:\local\project1\data\20100722D1-YFP.TXT.ext
Output file path: c:\local\project1\data
Output file name: 20100722D1-YFP.TXT.ext
Match number 4
Time: 06:40:42
Module name: TestModule
Message type: 06:40:42
Message: result success
FTP site URL: ftp.test.com
Input file: \root\level1\level2-DEF\2C.250B
Input file path: \root\level1\level2-DEF
Input file name: 2C.250B
Output file: c:\local\project1\data\2C.250B.ext
Output file path: c:\local\project1\data
Output file name: 2C.250B.ext
There were 4 matching records
"""
If you use a regex tool, it will make your life a lot easier for troubleshooting regex.如果您使用 regex 工具,它将使您的 regex 故障排除工作变得更加轻松。 Try this free one - there are probably better ones, but this works great.试试这个免费的- 可能有更好的,但这很好用。 You can paste your log file there, and try your regex a little bit at a time, and it will highlight matches in real time.您可以将您的日志文件粘贴到那里,并一次尝试一下您的正则表达式,它会实时突出显示匹配项。
Why regex?为什么是正则表达式?
Consider using split
to get all words.考虑使用split
来获取所有单词。 This will give you the timestamp directly.这将直接为您提供时间戳。 Then go through all other words, check if there's a =
in them, split them again in this case and there you have your paths and other parameters nicely.然后 go 通过所有其他词,检查其中是否有=
,在这种情况下再次拆分它们,您就有了路径和其他参数。 Standard Python path handling ( os.path
) will aid you at getting folder and file names.标准 Python 路径处理 ( os.path
) 将帮助您获取文件夹和文件名。
Of course this approach fails if your path names may contain spaces, but otherwise it is definitely worth consideration.当然,如果您的路径名可能包含空格,则此方法会失败,但否则绝对值得考虑。
You can do it simply by normal string processing您可以通过正常的字符串处理简单地做到这一点
f=open("file")
for line in f:
date,b = line.split("input")
print "time: " , date.split()[0]
input_path = b.split("output")[0]
tokens=input_path.split("\\")
filename=tokens[-1]
directory=tokens[-2].split("-")[-1]
print filename, directory
f.close()
This worked for your examples:这适用于您的示例:
r'(\d\d:\d\d:\d\d).*(ABC|DEF).*?([^\\]*)\soutput.*'
Although a well written regular expression is appropriate here, I would have approached this differently.尽管写得很好的正则表达式在这里是合适的,但我会以不同的方式处理这个问题。 Most specifically, os.path.split
is designed to separate filenames from base paths, and deals with all the corner cases that this regular expression ignores.更具体地说, os.path.split
旨在将文件名与基本路径分开,并处理此正则表达式忽略的所有极端情况。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.