[英]Python: While reading the file and counting the words in a line, I want to count words coming between " " or ' ' as a single word
我有一个文件,我必须在其中计算每行中的单词数,但是有一个技巧,无论 ' ' 或 " " 之间出现什么,都应该算作一个单词。
示例文件:
TopLevel
DISPLAY "In TopLevel. Starting to run program"
PERFORM OneLevelDown
DISPLAY "Back in TopLevel."
STOP RUN.
对于上述文件,每行中的字数必须如下:
Line: 1 has: 1 words
Line: 2 has: 2 words
Line: 3 has: 2 words
Line: 4 has: 2 words
Line: 5 has: 2 words
但我得到如下:
Line: 1 has: 1 words
Line: 2 has: 7 words
Line: 3 has: 2 words
Line: 4 has: 4 words
Line: 5 has: 2 words
from os import listdir
from os.path import isfile, join
srch_dir = r'C:\Users\sagrawal\Desktop\File'
onlyfiles = [srch_dir+'\\'+f for f in listdir(srch_dir) if isfile(join(srch_dir, f))]
for i in onlyfiles:
index = 0
with open(i,mode='r') as file:
lst = file.readlines()
for line in lst:
cnt = 0
index += 1
linewrds=line.split()
for lwrd in linewrds:
if lwrd:
cnt = cnt +1
print('Line:',index,'has:',cnt,' words')
如果您只有这种简单的格式(没有嵌套引号或转义引号),您可以使用简单的正则表达式:
lines = '''TopLevel
DISPLAY "In TopLevel. Starting to run program"
PERFORM OneLevelDown
DISPLAY "Back in TopLevel."
STOP RUN.'''.split('\n')
import re
counts = [len(re.findall('\'.*?\'|".*?"|\w+', l))
for l in lines]
# [1, 2, 2, 2, 2]
如果没有,你必须编写一个解析器
如果您正在寻找非正则表达式解决方案,这是我为您准备的方法:
# A simple function that will simply count words in each line
def count_words(line):
# Check the next function
line = manage_quotes(line)
words = line.strip()
# In case of several spaces in a row, We need to filter empty words
words = [word for word in words if len(word) > 0]
return len(words)
# This method will manage the quotes
def manage_quotes(line):
# We do not mind the escaped quotes, They are like a simple char
# Also since the changes will be local we can replace words in line
line = line.replace("\\\"", "Q").replace("\\\'", "q")
# As all words between 2 quotes act as one word we can replace them with 1 simple word and we start with `"`
# This loop will help to find all quotes in one line
while True:
i1 = line.find("\"")
if (i1 == -1): # No `"` anymore
break
i2 = line[i1+1:].find("\"") # Search after the previous one
if (i2 == -1): # What shall we do with not paired quotes???
# raise Exception()
break
line = line[:i1-1] + "QUOTE" + line[i2:]
# Now search for `'`
while True:
i1 = line.find("\'")
if (i1 == -1): # No `'` anymore
break
i2 = line[i1+1:].find("\'") # Search after the previous one
if (i2 == -1): # What shall we do with not paired quotes???
# raise Exception()
break
line = line[:i1-1] + "quote" + line[i2:]
return line
这就是这个方法的工作原理,例如,你有这样一行DISPLAY "Part One \'Test1\'" AND 'Part Two \"Test2\"'
首先,我们删除转义引号: DISPLAY "Part One qTest1q" AND 'Part Two QTest2Q'
然后我们替换双引号: DISPLAY QUOTE AND 'Part Two QTest2Q'
然后是另一个: DISPLAY QUOTE AND quote
现在我们数这个是 4
如果您在引用区域内保留一些标记,则无需正则表达式即可解决此问题。
str.split()
- 在空格处拆分,返回一个列表str.startswith()
str.endswith()
- 接受一个(元组)字符串,如果它以(任何一个)开始/结束,则返回 True代码:
# create input file
name = "file.txt"
with open(name, "w") as f:
f.write("""TopLevel
DISPLAY "In TopLevel. Starting to run program"
PERFORM OneLevelDown
DISPLAY "Back in TopLevel."
STOP RUN.""")
# for testing later
expected = [(1,1),(2,2),(3,2),(4,2),(5,2)] # 1 base line/word count
# program that counts words
counted = []
with open(name) as f:
for line_nr, content in enumerate(f,1): # 1 based line count
splt = content.split()
in_quotation = []
line_count = 0
for word in splt:
if not in_quotation:
line_count += 1 # only increments if list empty
if word.startswith(("'",'"')):
in_quotation.append(word[0])
if word.endswith(("'","'")):
in_quotation.pop()
counted.append((line_nr, line_count))
print(expected)
print(counted)
print("Identical: ", all(a == expected[i] for i,a in enumerate(counted)))
Output:
[(1, 1), (2, 2), (3, 2), (4, 2), (5, 2)]
[(1, 1), (2, 2), (3, 2), (4, 2), (5, 2)]
Identical: True
您可以修改代码 - 目前,如果您将 " 隔开,它的表现不佳 - 它不知道某事是结束还是开始,并且两个测试都是 True。
似乎上面附加的代码不关心'
或"
。这里是str.split
中 str.split 的定义。
如果 sep 未指定或为 None,则应用不同的拆分算法:连续空格的运行被视为单个分隔符,如果字符串具有前导或尾随空格,则结果将在开头或结尾不包含空字符串。 因此,使用 None 分隔符拆分空字符串或仅包含空格的字符串将返回 []。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.