简体   繁体   中英

Python: While reading the file and counting the words in a line, I want to count words coming between " " or ' ' as a single word

I have a file in which I have to count the number of words in each line, but there is a trick, whatever comes in between ' ' or " ", should be count as a single word.

Example file:

TopLevel  
    DISPLAY "In TopLevel. Starting to run program"  
    PERFORM OneLevelDown  
    DISPLAY "Back in TopLevel."  
    STOP RUN.

For above file the count of words in each line has to be as below:

Line: 1 has: 1 words  
Line: 2 has: 2 words  
Line: 3 has: 2 words  
Line: 4 has: 2 words  
Line: 5 has: 2 words

But I am getting as below:

Line: 1 has: 1 words  
Line: 2 has: 7 words  
Line: 3 has: 2 words  
Line: 4 has: 4 words  
Line: 5 has: 2 words
from os import listdir
from os.path import isfile, join

srch_dir = r'C:\Users\sagrawal\Desktop\File'

onlyfiles = [srch_dir+'\\'+f for f in listdir(srch_dir) if isfile(join(srch_dir, f))]

for i in onlyfiles:
index = 0
    with open(i,mode='r') as file:
        lst = file.readlines()
        for line in lst:
            cnt = 0
            index += 1
            linewrds=line.split()
            for lwrd in linewrds:
                if lwrd:
                    cnt = cnt +1
            print('Line:',index,'has:',cnt,' words')

If you only have this simple format (no nested quotes or escaped quotes), you could use a simple regex:

lines = '''TopLevel  
    DISPLAY "In TopLevel. Starting to run program"  
    PERFORM OneLevelDown  
    DISPLAY "Back in TopLevel."  
    STOP RUN.'''.split('\n')

import re
counts = [len(re.findall('\'.*?\'|".*?"|\w+', l))
          for l in lines]
# [1, 2, 2, 2, 2]

If not, you have to write a parser

If you are looking for a not regex solution, this is my method for you:

# A simple function that will simply count words in each line
def count_words(line):
    # Check the next function
    line = manage_quotes(line)
    words = line.strip()
    # In case of several spaces in a row, We need to filter empty words
    words = [word for word in words if len(word) > 0]
    return len(words)

# This method will manage the quotes
def manage_quotes(line):
    # We do not mind the escaped quotes, They are like a simple char
    # Also since the changes will be local we can replace words in line
    line = line.replace("\\\"", "Q").replace("\\\'", "q")

    # As all words between 2 quotes act as one word we can replace them with 1 simple word and we start with `"`
    # This loop will help to find all quotes in one line
    while True:
        i1 = line.find("\"")
        if (i1 == -1): # No `"` anymore
            break
        i2 = line[i1+1:].find("\"") # Search after the previous one
        if (i2 == -1): # What shall we do with not paired quotes???
            # raise Exception()
            break
        line = line[:i1-1] + "QUOTE" + line[i2:]
    # Now search for `'`
    while True:
        i1 = line.find("\'")
        if (i1 == -1): # No `'` anymore
            break
        i2 = line[i1+1:].find("\'") # Search after the previous one
        if (i2 == -1): # What shall we do with not paired quotes???
            # raise Exception()
            break
        line = line[:i1-1] + "quote" + line[i2:]
    
    return line

This is how this method works, For example, You have a line like this DISPLAY "Part One \'Test1\'" AND 'Part Two \"Test2\"'

At first, we remove escaped quotes: DISPLAY "Part One qTest1q" AND 'Part Two QTest2Q'

Then we replace double quotations: DISPLAY QUOTE AND 'Part Two QTest2Q'

Then the other one: DISPLAY QUOTE AND quote

And now we count this which is 4

You can solve this without regex if you keep some marker if you are inside a quoted area or not.

  • str.split() - splitts at spaces, returns a list
  • str.startswith()
  • str.endswith() - takes a (tuple of) string(s) and returns True if it starts/ends with (any of) it

Code:

# create input file
name = "file.txt"
with open(name, "w") as f:
    f.write("""TopLevel  
    DISPLAY "In TopLevel. Starting to run program"  
    PERFORM OneLevelDown  
    DISPLAY "Back in TopLevel."  
    STOP RUN.""")

# for testing later
expected = [(1,1),(2,2),(3,2),(4,2),(5,2)]  # 1 base line/word count

# program that counts words
counted = []
with open(name) as f:
    for line_nr, content in enumerate(f,1): # 1 based line count
        splt = content.split()
        in_quotation = []
        line_count = 0
        for word in splt:
            if not in_quotation:
                line_count += 1  # only increments if list empty
            if word.startswith(("'",'"')):
                in_quotation.append(word[0])
            
            if word.endswith(("'","'")):
                in_quotation.pop()
        counted.append((line_nr, line_count))
            
print(expected)
print(counted)
print("Identical: ", all(a == expected[i] for i,a in enumerate(counted)))

Output:

[(1, 1), (2, 2), (3, 2), (4, 2), (5, 2)]
[(1, 1), (2, 2), (3, 2), (4, 2), (5, 2)]
Identical: True

You can tinker with the code - currently it does not well behave if you space out your " - it does not know if something ends or starts and both tests are True.

It seems that the code attached above doesn't care about ' or " . And here is the definition of str.split in Pythonhere .

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM