繁体   English   中英

Python:在读取文件并计算一行中的单词时,我想将“”或“”之间的单词计算为一个单词

[英]Python: While reading the file and counting the words in a line, I want to count words coming between " " or ' ' as a single word

我有一个文件,我必须在其中计算每行中的单词数,但是有一个技巧,无论 ' ' 或 " " 之间出现什么,都应该算作一个单词。

示例文件:

TopLevel  
    DISPLAY "In TopLevel. Starting to run program"  
    PERFORM OneLevelDown  
    DISPLAY "Back in TopLevel."  
    STOP RUN.

对于上述文件,每行中的字数必须如下:

Line: 1 has: 1 words  
Line: 2 has: 2 words  
Line: 3 has: 2 words  
Line: 4 has: 2 words  
Line: 5 has: 2 words

但我得到如下:

Line: 1 has: 1 words  
Line: 2 has: 7 words  
Line: 3 has: 2 words  
Line: 4 has: 4 words  
Line: 5 has: 2 words
from os import listdir
from os.path import isfile, join

srch_dir = r'C:\Users\sagrawal\Desktop\File'

onlyfiles = [srch_dir+'\\'+f for f in listdir(srch_dir) if isfile(join(srch_dir, f))]

for i in onlyfiles:
index = 0
    with open(i,mode='r') as file:
        lst = file.readlines()
        for line in lst:
            cnt = 0
            index += 1
            linewrds=line.split()
            for lwrd in linewrds:
                if lwrd:
                    cnt = cnt +1
            print('Line:',index,'has:',cnt,' words')

如果您只有这种简单的格式(没有嵌套引号或转义引号),您可以使用简单的正则表达式:

lines = '''TopLevel  
    DISPLAY "In TopLevel. Starting to run program"  
    PERFORM OneLevelDown  
    DISPLAY "Back in TopLevel."  
    STOP RUN.'''.split('\n')

import re
counts = [len(re.findall('\'.*?\'|".*?"|\w+', l))
          for l in lines]
# [1, 2, 2, 2, 2]

如果没有,你必须编写一个解析器

如果您正在寻找非正则表达式解决方案,这是我为您准备的方法:

# A simple function that will simply count words in each line
def count_words(line):
    # Check the next function
    line = manage_quotes(line)
    words = line.strip()
    # In case of several spaces in a row, We need to filter empty words
    words = [word for word in words if len(word) > 0]
    return len(words)

# This method will manage the quotes
def manage_quotes(line):
    # We do not mind the escaped quotes, They are like a simple char
    # Also since the changes will be local we can replace words in line
    line = line.replace("\\\"", "Q").replace("\\\'", "q")

    # As all words between 2 quotes act as one word we can replace them with 1 simple word and we start with `"`
    # This loop will help to find all quotes in one line
    while True:
        i1 = line.find("\"")
        if (i1 == -1): # No `"` anymore
            break
        i2 = line[i1+1:].find("\"") # Search after the previous one
        if (i2 == -1): # What shall we do with not paired quotes???
            # raise Exception()
            break
        line = line[:i1-1] + "QUOTE" + line[i2:]
    # Now search for `'`
    while True:
        i1 = line.find("\'")
        if (i1 == -1): # No `'` anymore
            break
        i2 = line[i1+1:].find("\'") # Search after the previous one
        if (i2 == -1): # What shall we do with not paired quotes???
            # raise Exception()
            break
        line = line[:i1-1] + "quote" + line[i2:]
    
    return line

这就是这个方法的工作原理,例如,你有这样一行DISPLAY "Part One \'Test1\'" AND 'Part Two \"Test2\"'

首先,我们删除转义引号: DISPLAY "Part One qTest1q" AND 'Part Two QTest2Q'

然后我们替换双引号: DISPLAY QUOTE AND 'Part Two QTest2Q'

然后是另一个: DISPLAY QUOTE AND quote

现在我们数这个是 4

如果您在引用区域内保留一些标记,则无需正则表达式即可解决此问题。

  • str.split() - 在空格处拆分,返回一个列表
  • str.startswith()
  • str.endswith() - 接受一个(元组)字符串,如果它以(任何一个)开始/结束,则返回 True

代码:

# create input file
name = "file.txt"
with open(name, "w") as f:
    f.write("""TopLevel  
    DISPLAY "In TopLevel. Starting to run program"  
    PERFORM OneLevelDown  
    DISPLAY "Back in TopLevel."  
    STOP RUN.""")

# for testing later
expected = [(1,1),(2,2),(3,2),(4,2),(5,2)]  # 1 base line/word count

# program that counts words
counted = []
with open(name) as f:
    for line_nr, content in enumerate(f,1): # 1 based line count
        splt = content.split()
        in_quotation = []
        line_count = 0
        for word in splt:
            if not in_quotation:
                line_count += 1  # only increments if list empty
            if word.startswith(("'",'"')):
                in_quotation.append(word[0])
            
            if word.endswith(("'","'")):
                in_quotation.pop()
        counted.append((line_nr, line_count))
            
print(expected)
print(counted)
print("Identical: ", all(a == expected[i] for i,a in enumerate(counted)))

Output:

[(1, 1), (2, 2), (3, 2), (4, 2), (5, 2)]
[(1, 1), (2, 2), (3, 2), (4, 2), (5, 2)]
Identical: True

您可以修改代码 - 目前,如果您将 " 隔开,它的表现不佳 - 它不知道某事是结束还是开始,并且两个测试都是 True。

似乎上面附加的代码不关心'"这里str.split中 str.split 的定义。

如果 sep 未指定或为 None,则应用不同的拆分算法:连续空格的运行被视为单个分隔符,如果字符串具有前导或尾随空格,则结果将在开头或结尾不包含空字符串。 因此,使用 None 分隔符拆分空字符串或仅包含空格的字符串将返回 []。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM