简体   繁体   中英

Remove substring from each line of text file with regex

Text file ( file.txt ) looks like this:

First line.
2. Second line 
03 Third line
04. Fourth line
5. Line. 
6 Line

Desired output is 1) eliminating numbers at the beginning of line and 2) remove punctuation:

First line.
Second line
Third line
Fourth line
Line.
Line

I tried:

import re
file=open("file.txt").read().split()
print([i for i in file if re.sub("[0-9]\.*", "", i)])

But I get results only on word level instead of line level:

['First', 'line.', 'Second', 'line', 'Third', 'line', 'Fourth', 'line', 'Line.', 'Line']

Do not use the re module in the loop for . The possibilities of using regex are many and the re module can also be used as a multiline. For example, use the following:

>>> with open('/tmp/file.txt', 'r') as f:
        s = f.read()
>>> # or use direct value to test in the Python console:
>>> s = """First line.
... 2. Second line
... 03 Third line
... 04. Fourth line
... 5. Line.
... 6 Line"""

>>> s
'First line.\n2. Second line \n03 Third line\n04. Fourth line\n5. Line. \n6 Line'

>>> import re

>>> re.sub(r'[0-9\.\s]*(.*)', r'\1\n', s, flags=re.M)
'First line.\nSecond line \nThird line\nFourth line\nLine. \nLine\n'

>>> re.sub(r'^[0-9\.\s]*(.*)', r'\1', s, flags=re.M)
'First line.\nSecond line \nThird line\nFourth line\nLine. \nLine'

You may fix your current code using

with open("file.txt") as f:
    for line in f:
        print(re.sub("^[0-9]+\.?\s*", "", line.rstrip("\n")))

See a Python demo .

You need to open a file and read it line by line. Then, ^[0-9]+\\.?\\s* pattern searches for 1 or more digits ( [0-9]+ ) followed with an optional . ( \\.? ) and then 0+ whitespaces ( \\s* ) on each line and removes the match if found.

The split in this line

file=open("file.txt").read().split()

splits the file by spaces. Use

file=open("file.txt").read().split("\n")

instead to split the file by lines.

Another option is:

import re
f = """First line.
2. Second line
03 Third line
04. Fourth line
5. Line.
6 Line"""
print(re.sub(r"(\d{1,2}\.{,1}\s)", "", f));

it returns:

First line.
Second line
Third line
Fourth line
Line.
Line

It don't have to loop through each line.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM