简体   繁体   English

Python 正则表达式获取长路径中的文件名

[英]Python regular expression to get filename in a long path

I need to analysis some log file and it look like below, I would like to retrieve 3 parts of data,我需要分析一些日志文件,如下所示,我想检索 3 部分数据,

  1. Time时间
  2. part of directory, in this case, it would be ABC and DEF in input file.目录的一部分,在这种情况下,它将是输入文件中的 ABC 和 DEF。
  3. file name in input file, ie 2C.013000000B.dat, 20100722B.TXT, 20100722D1-XYZ.TXT and 2C.250B in this case.输入文件中的文件名,即 2C.013000000B.dat、20100722B.TXT、20100722D1-XYZ.TXT 和 2C.250B 在这种情况下。

I use this regular expression, but it fail to get third part.我使用这个正则表达式,但它无法获得第三部分。

(\d\d:\d\d:\d\d).*(ABC|DEF).*\\(\d\w\.?\w\..*)\soutput.*

Any suggestion will be appreciated.任何建议将不胜感激。

08:38:36   TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-ABC\2C.013000000B.dat output file=c:\local\project1\data\2C.013000000B.dat.ext
06:40:37   TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-ABC\20100722B.TXT output file=c:\local\project1\data\20100722B.TXT.ext
06:40:39   TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-DEF\20100722D1-XYZ.TXT output file=c:\local\project1\data\20100722D1-YFP.TXT.ext
06:40:42   TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-DEF\2C.250B output file=c:\local\project1\data\2C.250B.ext

BR BR

Edward爱德华

Using split is a good idea.使用拆分是个好主意。 If you really want a regex, I would do it like this:如果你真的想要一个正则表达式,我会这样做:

(\d\d:\d\d:\d\d).*?input file=.*?(ABC|DEF)\\\\(.*?)\soutput

Test it here在这里测试

Regular expressions are very good at solving problems like this one - ie parsing log file records.正则表达式非常擅长解决这样的问题——即解析日志文件记录。 MarcoS's answer solves your immediate problem nicely. MarcoS 的回答很好地解决了您的直接问题。 However, another approach is to write a (reusable) generalized function which decomposes a log file record into its various components and returns a match object containing all these parsed components.但是,另一种方法是编写一个(可重用的)通用 function,它将日志文件记录分解为其各种组件并返回包含所有这些已解析组件的匹配 object。 Once decomposed, tests can easily be applied to the component parts to check for various requirements (such as the input file path must end in ABC or DEF ).分解后,可以轻松地将测试应用于组件部分以检查各种要求(例如输入文件路径必须以ABCDEF结尾)。 Here is a python script which has just such a function: decomposeLogEntry() and demonstrates how to use it to solve your problem at hand:这是一个 python 脚本,它有一个 function: decomposeLogEntry()并演示如何使用它来解决您手头的问题:

import re
def decomposeLogEntry(text):
    r""" Decompose log file entry into its various components.

    If text is a valid log entry, return regex match object of
    log entry components strings. Otherwise return None."""
    return re.match(r"""
        # Decompose log file entry into its various components.
        ^                            # Anchor to start of string
        (?P<time>\d\d:\d\d:\d\d)     # Capture: time
        \s+
        (?P<modname>\w+?)            # Capture module name
        \s-\s\[
        (?P<msgtype>[^]]+)           # Capture message type
        \]
        (?P<message>[^!]+)           # Capture message text
        !!\sftp_site=
        (?P<ftpsite>\S+?)            # Capture ftp URL
        \sfile_dir=
        (?P<filedir>\S+?)            # Capture file directory?
        \sinput\sfile=
        (?P<infile>                  # Capture input path and filename
          (?P<infilepath>\S+)\\      # Capture input file path
          (?P<infilename>[^\s\\]+)   # Capture input file filename
        )
        \soutput\sfile=
        (?P<outfile>                 # Capture input path and filename
          (?P<outfilepath>\S+)\\     # Capture output file path
          (?P<outfilename>[^\s\\]+)  # Capture output file filename
        )
        \s*                          # Optional whitespace at end.
        $                            # Anchor to end of string
        """, text, re.IGNORECASE | re.VERBOSE)

# Demonstrate decomposeLogEntry function. Print components of all log entries.
f=open("testdata.log")
mcnt = 0
for line in f:
    # Decompose this line into its components.
    m = decomposeLogEntry(line)
    if m:
        mcnt += 1
        print "Match number %d" % (mcnt)
        print "  Time:             %s" % m.group("time")
        print "  Module name:      %s" % m.group("modname")
        print "  Message type:     %s" % m.group("time")
        print "  Message:          %s" % m.group("message")
        print "  FTP site URL:     %s" % m.group("ftpsite")
        print "  Input file:       %s" % m.group("infile")
        print "  Input file path:  %s" % m.group("infilepath")
        print "  Input file name:  %s" % m.group("infilename")
        print "  Output file:      %s" % m.group("outfile")
        print "  Output file path: %s" % m.group("outfilepath")
        print "  Output file name: %s" % m.group("outfilename")
        print "\n",
f.close()

# Next pick out only the desired data.
f=open("testdata.log")
mcnt = 0
matches = []
for line in f:
    # Decompose this line into its components.
    m = decomposeLogEntry(line)
    if m:
        # See if this record meets desired requirements
        if re.search(r"ABC$|DEF$", m.group("infilepath")):
            matches.append(line)
f.close()
print "There were %d matching records" % len(matches)

This function not only picks out the various parts you are interested in, it also validates the input and rejects badly formatted records.这个 function 不仅可以挑选出您感兴趣的各个部分,还可以验证输入并拒绝格式错误的记录。 Once written and debugged, this function can be reused by other programs which need to analyze the log files for other requirements.一旦编写和调试,这个 function 可以被其他需要分析日志文件以满足其他要求的程序重用。

Here is the output from the script when applied to your test data:这是应用到您的测试数据时脚本中的 output:

r"""
Match number 1
  Time:             08:38:36
  Module name:      TestModule
  Message type:     08:38:36
  Message:          result success
  FTP site URL:     ftp.test.com
  Input file:       \root\level1\level2-ABC\2C.013000000B.dat
  Input file path:  \root\level1\level2-ABC
  Input file name:  2C.013000000B.dat
  Output file:      c:\local\project1\data\2C.013000000B.dat.ext
  Output file path: c:\local\project1\data
  Output file name: 2C.013000000B.dat.ext

Match number 2
  Time:             06:40:37
  Module name:      TestModule
  Message type:     06:40:37
  Message:          result success
  FTP site URL:     ftp.test.com
  Input file:       \root\level1\level2-ABC\20100722B.TXT
  Input file path:  \root\level1\level2-ABC
  Input file name:  20100722B.TXT
  Output file:      c:\local\project1\data\20100722B.TXT.ext
  Output file path: c:\local\project1\data
  Output file name: 20100722B.TXT.ext

Match number 3
  Time:             06:40:39
  Module name:      TestModule
  Message type:     06:40:39
  Message:          result success
  FTP site URL:     ftp.test.com
  Input file:       \root\level1\level2-DEF\20100722D1-XYZ.TXT
  Input file path:  \root\level1\level2-DEF
  Input file name:  20100722D1-XYZ.TXT
  Output file:      c:\local\project1\data\20100722D1-YFP.TXT.ext
  Output file path: c:\local\project1\data
  Output file name: 20100722D1-YFP.TXT.ext

Match number 4
  Time:             06:40:42
  Module name:      TestModule
  Message type:     06:40:42
  Message:          result success
  FTP site URL:     ftp.test.com
  Input file:       \root\level1\level2-DEF\2C.250B
  Input file path:  \root\level1\level2-DEF
  Input file name:  2C.250B
  Output file:      c:\local\project1\data\2C.250B.ext
  Output file path: c:\local\project1\data
  Output file name: 2C.250B.ext

There were 4 matching records
"""

If you use a regex tool, it will make your life a lot easier for troubleshooting regex.如果您使用 regex 工具,它将使您的 regex 故障排除工作变得更加轻松。 Try this free one - there are probably better ones, but this works great.试试这个免费的- 可能有更好的,但这很好用。 You can paste your log file there, and try your regex a little bit at a time, and it will highlight matches in real time.您可以将您的日志文件粘贴到那里,并一次尝试一下您的正则表达式,它会实时突出显示匹配项。

Why regex?为什么是正则表达式?

Consider using split to get all words.考虑使用split来获取所有单词。 This will give you the timestamp directly.这将直接为您提供时间戳。 Then go through all other words, check if there's a = in them, split them again in this case and there you have your paths and other parameters nicely.然后 go 通过所有其他词,检查其中是否有= ,在这种情况下再次拆分它们,您就有了路径和其他参数。 Standard Python path handling ( os.path ) will aid you at getting folder and file names.标准 Python 路径处理 ( os.path ) 将帮助您获取文件夹和文件名。

Of course this approach fails if your path names may contain spaces, but otherwise it is definitely worth consideration.当然,如果您的路径名可能包含空格,则此方法会失败,但否则绝对值得考虑。

You can do it simply by normal string processing您可以通过正常的字符串处理简单地做到这一点

f=open("file")
for line in f:
    date,b = line.split("input")
    print "time: " , date.split()[0]
    input_path = b.split("output")[0]
    tokens=input_path.split("\\")
    filename=tokens[-1]
    directory=tokens[-2].split("-")[-1]
    print filename, directory
f.close()

This worked for your examples:这适用于您的示例:

r'(\d\d:\d\d:\d\d).*(ABC|DEF).*?([^\\]*)\soutput.*'

Although a well written regular expression is appropriate here, I would have approached this differently.尽管写得很好的正则表达式在这里是合适的,但我会以不同的方式处理这个问题。 Most specifically, os.path.split is designed to separate filenames from base paths, and deals with all the corner cases that this regular expression ignores.更具体地说, os.path.split旨在将文件名与基本路径分开,并处理此正则表达式忽略的所有极端情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM