简体   繁体   English

使用python计算文件中单词之间的空格数?

[英]Counting the number of spaces between words in a file using python?

I'm really close. 我真的很亲密 I read through " number of space between each word " and it does provide this line: 我读了“ 每个单词之间的空格数 ”,它确实提供了这一行:

counts = [(len(list(cpart))) for c,cpart in groupby(s) if c == ' ']

but I really don't understand it... I understand, or am assuming, C is the delimiter, S is what you're grouping by, and you're putting the resulting list?(new to python, array?) into counts (S is referent to a previously instantiated variable) 但我真的不明白......我理解,或者假设,C是分隔符,S是你正在分组的,你把结果列表?(python的新内容,数组?)到计数(S表示先前实例化的变量)

How would I determine something like this? 我如何确定这样的事情?

                                                  AMOUNT       DATE       
   NAME          ACCOUNT#         DISCOUNT         DUE         DUE

I am creating a program that allows me to look at a randomly created COBOL output file headers and use it to create the PIC(X)'s associated. 我正在创建一个程序,允许我查看随机创建的COBOL输出文件头,并使用它来创建PIC(X)的关联。

Example solution output would be: 示例解决方案输出将是:

  1. PIC X(30) VALUE SPACES. PIC X(30)VALUE SPACES。
  2. PIC X(6) VALUE "AMOUNT". PIC X(6)值“AMOUNT”。
  3. PIC X(8) VALUE SPACES. PIC X(8)值空间。
  4. PIC X(4) VALUE "DATE". PIC X(4)VALUE“DATE”。

the important parts are the numbers. 重要的部分是数字。 I can determine lengths of strings obviously, but the spaces i'm not sure how... 我可以明确地确定字符串的长度,但空间我不知道如何...

Here is what I have so far to show i'm working lol: 这是我到目前为止表明我正在工作大声笑:

from itertools import groupby
from test.test_iterlen import len
from macpath import split
from lib2to3.fixer_util import String

file = open("C:\\Users\\Joshua\\Desktop\\Practice\\cobol.cbl", 'r+')

line1 = file.readline()
split = line1.split()
print (split)
print ()

counts = [(len(list(cpart))) for c,cpart in groupby(split) if c == ' ']

print (counts)


index = 0
while index != split.__len__():
    if split[index].strip() != None:
        print ("PICX(" + ") VALUE " + "\"" + split[index] + "\".")
    elif counts[index] == None:
        print ("PICX(" + ") VALUE " + "\"" + split[index] + "\".") 
    index+=1

I'll begin by explaining the first line: 我将首先解释第一行:

counts = [(len(list(cpart))) for c,cpart in groupby(s) if c == ' ']

s is actually the input string. s实际上是输入字符串。 So, to run this you'd start with: 所以,要运行它,你应该从:

s = "   NAME          ACCOUNT#         DISCOUNT         DUE         DUE"

groupby(s) returns an iterator of tuples. groupby(s)返回元组的迭代器。 The first value in that tuple is the character from the input string, and the second value is another (nested) iterator that will iterate through the repeated values of the character. 该元组中的第一个值是输入字符串中的字符,第二个值是另一个(嵌套的)迭代器,它将迭代字符的重复值。 Put into list form (for illustration) it would look like this: 放入list表单(用于说明)它看起来像这样:

groupby("hello!!!")
[('h', ['h']), ('e', ['e']), ('l', ['l', 'l']), ('o', ['o']), ('!', ['!', '!', '!'])]

So, c is not a delimiter, but it's the variable that holds each character in the string s , and cpart is the iterator through all the consecutive values of c . 因此, c不是分隔符,但它是保存字符串s每个字符的变量,而cpart是通过c所有连续值的迭代器。 Once you call len(cpart) it gives a list of [c,c,c,...] (each item is the same!) and the length of that list is the number of times that the character c is repeated. 一旦你调用len(cpart)它就会给出一个[c,c,c,...] (每个项目是相同的!),该列表的长度是字符c重复的次数。 Normally it will just be one. 通常它只是一个。 For example, for the 'A' in 'NAME ' you'll get c == A and list(cpart) == ['A'] . 例如,对于'NAME '中的'A' ,您将获得c == Alist(cpart) == ['A'] But for the spaces between NAME and ACCOUNT# , you'll get c == ' ' and cpart == [' ',' ',' ',' ',' ',' ',' ',' ',' ',' '] . 但是对于NAMEACCOUNT#之间的空格,你会得到c == ' 'cpart == [' ',' ',' ',' ',' ',' ',' ',' ',' ',' ']

The whole thing being inside brackets [] means that it generates a list as if you were appending to a list within a for loop, and the value of each item is the expression before the for . 整个事物位于方括号[]意味着它生成一个列表,就好像你要附加到for循环中的列表一样,每个项的值是for之前的表达式。 Here, it's the len(list(cpart)) which counts the length of that list of repeated instances of a character. 这里是len(list(cpart)) ,它计算字符重复实例列表的长度。 Thus, it'll be a list with the numbers of times a character is repeated. 因此,它将是一个列表,其中包含重复字符的次数。 The if c == ' ' means that item will be added to the list only when that character is a space. if c == ' '表示只有当该字符是空格时才会将该项添加到列表中。


The above will count the spaces. 以上将计算空间。 To count the words (eg, to get PIC X(6) VALUE "AMOUNT") you can simply do something like: 要计算单词(例如,要获得PIC X(6)VALUE“AMOUNT”),您可以简单地执行以下操作:

word_counts = [ len(word) for word in s.split() ]

where split (which you have used) returns a list of words that had been previously one string separated by spaces. split (已使用)返回一个单词列表,这些单词之前是一个以空格分隔的字符串。

There's no particular point in breaking up the output like that. 像那样分解输出没有特别的意义。 You coould: 你可以:

     05  FILLER (optional) PIC X(width-of-report) VALUE
     "                              AMOUNT        DATE             "(in column 72)
-                         ".

The "-" is in column 7, and shows the continuation of an alphanumeric literal, which needs no opening quote, but needs a closing quote. “ - ”位于第7列,显示字母数字文字的延续,不需要开头报价,但需要结束报价。

Your processing to create that is very simple. 您创建的处理非常简单。 You always output those three lines, all you have to do is "chop" your data into 59 bytes (for the second line) and "the rest" (not knowing your report width) for the third line. 您总是输出这三行,您只需将数据“切”为59字节(对于第二行)和第三行的“其余”(不知道您的报告宽度)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM