Python：在读取文件并计算一行中的单词时，我想将“”或“”之间的单词计算为一个单词

Question

我有一个文件，我必须在其中计算每行中的单词数，但是有一个技巧，无论 ' ' 或 " " 之间出现什么，都应该算作一个单词。

示例文件：

TopLevel  
    DISPLAY "In TopLevel. Starting to run program"  
    PERFORM OneLevelDown  
    DISPLAY "Back in TopLevel."  
    STOP RUN.

对于上述文件，每行中的字数必须如下：

Line: 1 has: 1 words  
Line: 2 has: 2 words  
Line: 3 has: 2 words  
Line: 4 has: 2 words  
Line: 5 has: 2 words

但我得到如下：

Line: 1 has: 1 words  
Line: 2 has: 7 words  
Line: 3 has: 2 words  
Line: 4 has: 4 words  
Line: 5 has: 2 words

from os import listdir
from os.path import isfile, join

srch_dir = r'C:\Users\sagrawal\Desktop\File'

onlyfiles = [srch_dir+'\\'+f for f in listdir(srch_dir) if isfile(join(srch_dir, f))]

for i in onlyfiles:
index = 0
    with open(i,mode='r') as file:
        lst = file.readlines()
        for line in lst:
            cnt = 0
            index += 1
            linewrds=line.split()
            for lwrd in linewrds:
                if lwrd:
                    cnt = cnt +1
            print('Line:',index,'has:',cnt,' words')

Answer 1

如果您只有这种简单的格式（没有嵌套引号或转义引号），您可以使用简单的正则表达式：

lines = '''TopLevel  
    DISPLAY "In TopLevel. Starting to run program"  
    PERFORM OneLevelDown  
    DISPLAY "Back in TopLevel."  
    STOP RUN.'''.split('\n')

import re
counts = [len(re.findall('\'.*?\'|".*?"|\w+', l))
          for l in lines]
# [1, 2, 2, 2, 2]

如果没有，你必须编写一个解析器

Answer 2

如果您正在寻找非正则表达式解决方案，这是我为您准备的方法：

# A simple function that will simply count words in each line
def count_words(line):
    # Check the next function
    line = manage_quotes(line)
    words = line.strip()
    # In case of several spaces in a row, We need to filter empty words
    words = [word for word in words if len(word) > 0]
    return len(words)

# This method will manage the quotes
def manage_quotes(line):
    # We do not mind the escaped quotes, They are like a simple char
    # Also since the changes will be local we can replace words in line
    line = line.replace("\\\"", "Q").replace("\\\'", "q")

    # As all words between 2 quotes act as one word we can replace them with 1 simple word and we start with `"`
    # This loop will help to find all quotes in one line
    while True:
        i1 = line.find("\"")
        if (i1 == -1): # No `"` anymore
            break
        i2 = line[i1+1:].find("\"") # Search after the previous one
        if (i2 == -1): # What shall we do with not paired quotes???
            # raise Exception()
            break
        line = line[:i1-1] + "QUOTE" + line[i2:]
    # Now search for `'`
    while True:
        i1 = line.find("\'")
        if (i1 == -1): # No `'` anymore
            break
        i2 = line[i1+1:].find("\'") # Search after the previous one
        if (i2 == -1): # What shall we do with not paired quotes???
            # raise Exception()
            break
        line = line[:i1-1] + "quote" + line[i2:]
    
    return line

这就是这个方法的工作原理，例如，你有这样一行DISPLAY "Part One \'Test1\'" AND 'Part Two \"Test2\"'

首先，我们删除转义引号： DISPLAY "Part One qTest1q" AND 'Part Two QTest2Q'

然后我们替换双引号： DISPLAY QUOTE AND 'Part Two QTest2Q'

然后是另一个： DISPLAY QUOTE AND quote

现在我们数这个是 4

Answer 3

如果您在引用区域内保留一些标记，则无需正则表达式即可解决此问题。

str.split() - 在空格处拆分，返回一个列表
str.startswith()
str.endswith() - 接受一个（元组）字符串，如果它以（任何一个）开始/结束，则返回 True

代码：

# create input file
name = "file.txt"
with open(name, "w") as f:
    f.write("""TopLevel  
    DISPLAY "In TopLevel. Starting to run program"  
    PERFORM OneLevelDown  
    DISPLAY "Back in TopLevel."  
    STOP RUN.""")

# for testing later
expected = [(1,1),(2,2),(3,2),(4,2),(5,2)]  # 1 base line/word count

# program that counts words
counted = []
with open(name) as f:
    for line_nr, content in enumerate(f,1): # 1 based line count
        splt = content.split()
        in_quotation = []
        line_count = 0
        for word in splt:
            if not in_quotation:
                line_count += 1  # only increments if list empty
            if word.startswith(("'",'"')):
                in_quotation.append(word[0])
            
            if word.endswith(("'","'")):
                in_quotation.pop()
        counted.append((line_nr, line_count))
            
print(expected)
print(counted)
print("Identical: ", all(a == expected[i] for i,a in enumerate(counted)))

Output：

[(1, 1), (2, 2), (3, 2), (4, 2), (5, 2)]
[(1, 1), (2, 2), (3, 2), (4, 2), (5, 2)]
Identical: True

您可以修改代码 - 目前，如果您将 " 隔开，它的表现不佳 - 它不知道某事是结束还是开始，并且两个测试都是 True。

Answer 4

似乎上面附加的代码不关心'或" 。这里是str.split中 str.split 的定义。

如果 sep 未指定或为 None，则应用不同的拆分算法：连续空格的运行被视为单个分隔符，如果字符串具有前导或尾随空格，则结果将在开头或结尾不包含空字符串。 因此，使用 None 分隔符拆分空字符串或仅包含空格的字符串将返回 []。

Python：在读取文件并计算一行中的单词时，我想将“”或“”之间的单词计算为一个单词

问题描述

4 个解决方案

解决方案1
1 2021-12-22 06:24:46

解决方案2
1 2021-12-22 07:11:42

解决方案3
1 2021-12-22 07:18:46

解决方案4
0 2021-12-22 06:25:17

Python：在读取文件并计算一行中的单词时，我想将“”或“”之间的单词计算为一个单词

问题描述

4 个解决方案

解决方案1 1 2021-12-22 06:24:46

解决方案2 1 2021-12-22 07:11:42

解决方案3 1 2021-12-22 07:18:46

解决方案4 0 2021-12-22 06:25:17

解决方案1
1 2021-12-22 06:24:46

解决方案2
1 2021-12-22 07:11:42

解决方案3
1 2021-12-22 07:18:46

解决方案4
0 2021-12-22 06:25:17