简体   繁体   中英

Python - count key value pairs from text file

I have the following text file:

abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1

each key pair is how many times each string appears in a document [docID]:[stringFq]

How could you calculate the number of key pairs in this text file?

Your regex approach works fine. Here is an iterative approach. If you uncomment the print statements you will uncover some itermediate results.

Given

%%file foo.txt
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1

Code

import itertools as it


with open("foo.txt") as f:                                  
    lines = f.readlines()
    #print(lines)
    pred = lambda x: x.isalpha()                           

    count = 0                                              
    for line in lines:
        line = line.strip("\n")
        line = "".join(it.dropwhile(pred, line))
        pairs = line.strip().split(" ")
        #print(pairs)
        count += len(pairs)

count
# 15 

Details

First we use a with statement, which an idiom for safely opening and closing files. We then split the file into lines via readlines() . We define a conditional function (or predicate) that we will use later. The lambda expression is used for convenience and is equivalent to the following function:

def pred(x):
    return x.isaplha()

We initialize a count variable and start iterating each line. Every line may have a trailing newline character \\n , so we first strip() them away before feeding the line to dropwhile .

dropwhile is a special itertools iterator. As it iterates a line, it will discard any leading characters that satisfy the predicate until it reaches the first character that fails the predicate. In other words, all letters at the start will be dropped until the first non-letter is found (which happens to be a space). We clean the new line again, stripping the leading space, and the remaining string is split() into a list of pairs .

Finally the length of each line of pairs is incrementally added to count . The final count is the sum of all lengths of pairs .

Summary

The code above shows how to tackle basic file handling with simple, iterative steps:

  • open the file
  • split the file into lines
  • while iterating each line, clean and process data
  • output a result
import re


file = open('input.txt', 'r')
file = file.read()
numbers = re.findall(r"[-+]?\d*\.\d+|\d+", file)
#finds all ints from text file
numLen = len(numbers) / 2
#counts all ints, when I needed to count pairs, so I just divided it by 2

print(numLen)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM