简体   繁体   中英

I have a text file with multiple lines. How can I extract a portion from each line using regex in python?

The line input is like this:

-rw-r--r-- 1 jttoivon hyad-all   25399 Nov  2 21:25 exception_hierarchy.pdf

Required output is:

25399 Nov  2 21:25 exception_hierarchy.pdf

which is size , month , day , hour , minute and filename respectively.

The question asks to return a list of tuples (size, month, day, hour, minute, filename) using regular expressions to do this (either match , search , findall , or finditer method).

My code that I tried is -

for line in range(1):
    line=f.readline()
x=re.findall(r'[^-]\d+\w+:\w+.*\w+_*',line)
    print (x)

My output - [' 21:25 add_colab_link.py']

please have a read of the following example on how to ask great questions: How to make a great R reproducible example

I answer your question because not long ago I did the same mistakes and I was happy if someone still answered.

import re  # import of regular expression library

# I just assume you had three of those pieces in one list:
my_list = ["-rw-r--r-- 1 jttoivon hyad-all 12345 Nov 2 21:25 exception_hierarchy.pdf", "-rw-r--r-- 1 jttoivon hyad-all 25399 Nov 2 21:25 exception_hierarchy.pdf", "-rw-r--r-- 1 jttoivon hyad-all 98765 Nov 2 21:25 exception_hierarchy.pdf"]

# I create a new list to store the results in
new_list = []

# I produce this loop to go through every piece in the list:
for x in my_list:
    y = re.findall("([0-9]{5}.+pdf)", x) # you can check the meaning of the symbols with a simple google search
    for thing in y:
        a, b, c, d, e = thing.split(" ")
        g, h = d.split(":")
        z = (a, b, c, g, h, e)
        new_list.append(z)

print(new_list)

Here's a working example using regular expressions thanks to package re :

>>> import re
>>> line = "-rw-r--r-- 1 jttoivon hyad-all   25399 Nov  2 21:25 exception_hierarchy.pdf"
>>> pattern = r"([\d]+)\s+([A-z]+)\s+(\d{1,2})\s+(\d{1,2}):(\d{1,2})\s+(.+)$"
>>> output_tuple = re.findall(pattern, line)[0]
>>> print(output_tuple)
('25399', 'Nov', '2', '21', '25', 'exception_hierarchy.pdf')
>>> size, month, day, hour, minute, filename = output_tuple

Most of the logic is encoded in the raw pattern variable. It's very easy though if you look at it piece by piece. See below, with new lines to help you read through:

([\d]+)    # means basically group of digits (size)
\s+        # means one or more spaces
([A-z]+)   # means one or more letter (month)
\s+        # means one or more spaces
(\d{1,2})  # one or two digits (day)
\s+        # means one or more spaces
(\d{1,2})  # one or two digits (hour)
:          # looking for a ':'
(\d{1,2})  # one or two digits (minute)
\s+        # means one or more spaces
(.+)       # anything basically
$          # until the string ends

By the way, here's a working example not using re (because it's actually not mandatory here):

>>> line = "-rw-r--r-- 1 jttoivon hyad-all   25399 Nov  2 21:25 exception_hierarchy.pdf"
>>> size, month, day, hour_minute, filename = line.split("hyad-all")[1].strip().split()
>>> hour, minute = hour_minute.split(":")
>>> print(size, month, day, hour, minute, filename)
25399 Nov 2 21 25 exception_hierarchy.pdf

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM