简体   繁体   中英

Delete x-line paragraphs from text file with Python

I have a long text file with paragraph with 6 and 7 lines each. I need to take all seven line paragraphs and write them to a file and take six line paragraphs and write them to a file. Or delete 6-line (7-line) paragraphs. Each paragraph is separated with blank line (or two blank lines). Text file example:

Firs Name Last Name
address1
Address2
Note 1
Note 2
Note3
Note 4

First Name LastName
add 1
add 2
Note2
Note3
Note4

etc...

I want to use python 3 for windows. Any help is welcome. Thanks!

As a welcome on stackoverflow, and because I think you have now searched more for a code , I propose you the following code.

It verifies that the paragraphs have not more than 7 lines and not less than 6 lines. It warns when such paragraphs exist in the source.

You'll remove all the prints to have a clean code, but with them you can follow the algorithm.

I think there is no bug in it, but don't take that as 100 % sure.

It isn't the only manner to do , but I choosed the way that can be used for all types of files, big or not: iterating one line at a time. Reading the entire file in one pass could be done, and then split into a list of lines, or treated with help of regexes; however , when a file is enormous, reading it all in one time is memory consuming.

with open('source.txt') as fsource,\
     open('SIX.txt','w') as six,  open('SEVEN.txt','w') as seven:

    buf = []
    cnt = 0
    exceeding7paragraphs = 0
    tinyparagraphs = 0

    line = 'go'
    while line:
        line = fsource.readline()
        cnt += 1
        buf.append(line)

        if len(buf)<6 and line.rstrip('\n\r')=='':
            tinyparagraphs += 1
            print cnt,repr(line),"this line of paragraph < 6 is void,"+\
                  "\nthe treatment of all this paragraph is skipped\n"+\
                  '\n# '+str(cnt)+' '+ repr(line)+" skipped line "
            buf = []
            while line and line.rstrip('\n\r')=='':
                line = fsource.readline()
                cnt += 1
                if line=='':
                    print "line",cnt,"is '' , EOF -> the program will be stopped"
                elif line.rstrip('\n\r')=='':
                    print '#',cnt,repr(line)
                else:
                    buf.append(line)
                    print '!',cnt,repr(line),' put in void buf'
        else:
            print cnt,repr(line),' put in buf'




        if len(buf)==6:
            line = fsource.readline() # reading a potential seventh line of a paragraph
            cnt += 1

            if line.rstrip('\n\r'): # means the content of the seventh line isn't void
                buf.append(line)
                print cnt,repr(line),'seventh line put in buf'
                line = fsource.readline()
                cnt += 1

                if line.rstrip('\n\r'): # means the content of the eighth line isn't void
                    exceeding7paragraphs += 1
                    print cnt,repr(line),"the eight line isn't void,"+\
                          "\nthe treatment of all this paragraph is skipped"+\
                          "\neighth line skipped"
                    buf = []
                    while line and line.rstrip('\n\r'):
                        line = fsource.readline()
                        cnt += 1
                        if line=='':
                            print "line",cnt,"is '' , EOF -> the program will be stopped"
                        elif line.rstrip('\n\r')=='':
                            print '\n#',cnt,repr(line)
                        else:
                            print str(cnt) + ' ' + repr(line)+' skipped line'

                else:
                    if line=='':
                        print cnt,"line is '' , EOF -> the program will be stopped\n"
                    else: # line.rstrip('\n\r') is ''
                        print cnt,'eighth line is void',repr(line)
                    seven.write(''.join(buf) + '\n')
                    print buf,'\n',len(buf),'lines recorded in file SEVEN\n'
                    buf = []

            else:
                print cnt,repr(line),'seventh line: void'
                six.write(''.join(buf) + '\n')
                print buf,'\n',len(buf),'lines recorded in file SIX'
                buf = []
                if line=='':
                    print "line",cnt,"is '' , EOF -> the program will be stopped"
                else:
                    print '\nthe line is',cnt, repr(line)

            while line and line.rstrip('\n\r')=='':
                line = fsource.readline()
                cnt += 1
                if line=='':
                    print "line",cnt,"is '' , EOF -> the program will be stopped"
                elif line.rstrip('\n\r')=='':
                    print '#',cnt,repr(line)
                else: # line.rstrip('\n\r') != ''
                    buf.append(line)
                    print '!',cnt,repr(line),' put in void buf'

if exceeding7paragraphs>0:
    print '\nWARNING :'+\
          '\nThere are '+str(exceeding7paragraphs)+' paragraphs whose number of lines exceeds 7.'

if tinyparagraphs>0:
    print '\nWARNING :'+\
          '\nThere are '+str(tinyparagraphs)+' paragraphs whose number of lines is less than 6.'


print '\n===================================================================='
print 'File SIX\n'
with open('SIX.txt') as six:
    print six.read()


print '===================================================================='
print 'File SEVEN\n'
with open('SEVEN.txt') as seven:
    print seven.read()

I also upvote your question because it is a problem not so easy that it's seems to solve, and to not let you with one post and one downvote, it is demoralizing as a beginning. Try to make your presentation better next time, as other said.

.

EDIT:

here's a simplified code for a text containing only paragraphs of 6 or 7 lines precisely, separated by 1 or 2 lines exactly, as stated in the problem's wording

with open('source2.txt') as fsource,\
     open('SIX.txt','w') as six,  open('SEVEN.txt','w') as seven:

    buf = []

    line = fsource.readline()
    while not line: # to go to the first non empty line
        line = fsource.readline()


    while True:
        buf.append(line) # this line is the first of a paragraph
        print '\n- first line of a paragraph',repr(line)

        for i in xrange(5):
            buf.append(fsource.readline())
        # at this point , 6 lines of a paragraph have been read
        print '-- buf 6 : ',buf

        line = fsource.readline()
        print '--- line seventh',repr(line),id(line)

        if line.rstrip('\r\n'):
            buf.append(line)
            seven.write(''.join(buf) + '\n')
            buf = []
            line = fsource.readline()
        else:
            six.write(''.join(buf) + '\n')
            buf = []
        # at this point, line is the empty line after a paragraph or EOF
        print '---- line after',repr(line),id(line)

        line = fsource.readline()
        print '----- second line after',repr(line)
        # at this point, line is an empty line after a paragraph or EOF
        # or the first line of a new paragraph

        if not line: # it is EOF
            break
        if not line.rstrip('\r\n'): # it is a second empty line
            line = fsource.readline()
        # now line is the first of a new paragraph


print '\n===================================================================='
print 'File SIX\n'
with open('SIX.txt') as six:
    print six.read()


print '===================================================================='
print 'File SEVEN\n'
with open('SEVEN.txt') as seven:
    print seven.read()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM