简体   繁体   English

使用 Python 计算文本文件中每行有多少个单词(使用 str.split)

[英]Counting how many words each line has in at text file with Python (using str.split)

I have two files, one for input which is "our_input.txt" (same directory as the code file), which is Oscar Wild's Dorian Gray.我有两个文件,一个用于输入,它是“our_input.txt”(与代码文件相同的目录),即 Oscar Wild 的 Dorian Gray。 Anyway, there's also an output file, and I need to open the original file, and let Python count how many words each line has, and write in the output.不管怎样,还有一个输出文件,我需要打开原始文件,让Python统计每行有多少字,并写在输出中。

I tried, but I got lost...我试过了,但我迷路了......

You can try something like this.你可以尝试这样的事情。

First you read your input file:首先你阅读你的输入文件:

with open('our_input.txt') as f:
    lines = f.readlines()

Then you count the number of words per line and write to the output file:然后计算每行的单词数并写入输出文件:

with open('our_output.txt', 'w') as f:
    for index, value in enumerate(lines):
        number_of_words = len(value.split())        
        f.write('Line number {} has {} words.\n'.format(index + 1, number_of_words))

You will need to to iterate over each line of the input text file.您需要遍历输入文本文件的每一行。 That's done with a standard for loop.这是通过标准的 for 循环完成的。 You can after split each line at each space char, and count with len() the number of elements in the list.您可以在每个空格字符处拆分每一行,并使用 len() 计算列表中元素的数量。 You append this to the output file and you are done您将其附加到输出文件中,您就完成了

A simple technique in any language for word counting in files is:任何语言中用于文件中字数统计的简单技术是:

  1. Read file into a variable.将文件读入变量。
  2. Replace unnecessary characters such as carriage returns or line feeds with space character.用空格字符替换不必要的字符,例如回车或换行符。 Trim space characters from beginning and end of string.从字符串的开头和结尾修剪空格字符。
  3. Replace multiple space characters with single.用单个替换多个空格字符。

We now have a string with words separated by single spaces.我们现在有一个由单个空格分隔的单词的字符串。

Now either现在要么

  • Use the language's split function with space as the delimiter, to produce an array.使用语言的 split 函数,以空格为分隔符,生成一个数组。 The number of words is the array length, adjusted for the lower bound of the array being zero or 1 in the language in use.字数是数组长度,根据使用的语言中数组的下限为零或 1 进行调整。

or

  • If the language has a count-character-of-specified-type function then use that to count the number of spaces in the string.如果该语言具有 count-character-of-specified-type 函数,则使用它来计算字符串中的空格数。 Add 1. This is the number of words.加1。这是字数。

The size of the file being worked upon could make this a weighty job for the processor and performance will depend on how the language handles strings and arrays.正在处理的文件的大小可能使处理器成为一项繁重的工作,性能将取决于语言处理字符串和数组的方式。

If you are working client-server or the text is stored in a database consider the high network cost of moving the string.如果您正在使用客户端 - 服务器或文本存储在数据库中,请考虑移动字符串的高网络成本。 Better to run the count as close to the data location as possible.最好在尽可能靠近数据位置的地方运行计数。 So if using an RDBMS use a stored procedure - faster to count words in a 2Gb string and ship an int variable with the answer out to the client than to ship the 2Gb string and count in a web browser.因此,如果使用 RDBMS 使用存储过程 - 计算 2Gb 字符串中的单词并将带有答案的 int 变量发送给客户端比发送 2Gb 字符串并在 Web 浏览器中计数更快。

If you cannot read the entire file in one pass then you can read line-by-line and carry out the above techniques per line.如果您不能一次读取整个文件,那么您可以逐行读取并每行执行上述技术。 However, due to string handling and loop-running overhead, performance will be faster if you can process the entire file as one string.但是,由于字符串处理和循环运行开销,如果您可以将整个文件作为一个字符串处理,则性能会更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM