简体   繁体   中英

Using python to count string occurrences in a flat file

I'm trying to complete an online course and one question is to count the number of occurrences of the word "fantastic" in a large file. When an occurrence is found the first element of that line needs to be stored (the id) to build up a list of lines(ids) containing the word. So far I have the below which is reading the lines correctly but I can't figure out how to check if "fantastic" is somewhere in that line in upper/lower case. I've tried using row.count('fantastic') which didn't work as I'm not sure how csv reader stores the lines, if I can get them counting I can just add the id to array and print it at the end when once or more occurrences are found per line.

#!/usr/bin/python
import sys
import csv

def main():
    f = open("test_file.txt", 'rt')
    filereader = csv.reader(f, delimiter='      ', quotechar='"')
    for row in filereader:
        print row[0]
        print row.count('fantastic')

if __name__ == "__main__":
    main()

Below is a very small sample set where I've thrown in a few fantastic's.

"6361"  "When will unit 2 be online? fantastic"   "cs101 unit2"   "100003292"     "<p>When will unit 2 be online?</p>"    "question"      "\N"    "\N"    "2012-02-26 15:47:12.522262+00" "0"     "(closed)"      "51919" "100003292"     "2012-03-03 10:12:27.41521+00"  "21196" "\N"    "\N"    "186"   "t"
"7185"  "Hungarian group"       "cs101 hungarian nationalities" "100003268"     "<p>Hi there! This is FANTASTIC</p>
<p>Any Hungarians doing the course? We could form a group!<br>
;)</p>" "question"      "\N"    "\N"    "2012-02-27 15:09:11.184434+00" "0"     ""      "\N"    "100003268"     "2012-02-27 15:09:11.184434+00" "9322"  "\N"    "\N"    "106"   "f"
"26454" "Course Application."   "cs101 application."    "100003192"     "<p>Please tell about the Course Application.  How to use the Course for higher education and jobs?</p>" "question"      "\N"    "\N"    "2012-03-08 08:34:06.704674+00" "-1"    ""      "\N"    "100003192"     "2012-03-08 08:34:06.704674+00" "34477" "\N"    "\N"    "73"    "f"

I would expect the output to be 6361, 7185

You are close.

First, make sure that those are not tabs rather than spaces.

Second, if you use csv, the result is a list for each row. You need to check each string in the list. You can either use any or join to make a single string.

Third, you need to use lower() since 'FANTASTIC' is not the same as 'fantastic'

import csv

def main():
    f = open("test_file.txt", 'rt')
    filereader = csv.reader(f, delimiter='\t')
    for row in filereader:
        if any('fantastic' in e.lower() for e in row[1:]):
            print row[0]

To gather all the rows into a list, you might do something like:

def main():
    result=[]
    with open("/tmp/so.csv", 'rt') as f:
        filereader = csv.reader(f, delimiter='\t', quotechar='"')
        for row in filereader:
            if any('fantastic' in e.lower() for e in row[1:]):
                result.append(row[0])
    print result       

The default quote character is already " so you don't need to specify that, but if you've got a tab delimited file, passing in '\\t' as the delimiter will correctly interpret the columns.

What you can do is build a generator to filter rows based on whether the substring 'fantastic' appears in any columns after the ID, then use a list comprehension to extract the IDs, eg:

with open('test_file.txt') as fin:
    csvin = csv.reader(fin, delimiter='\t')
    has_fantastic = (row for row in csvin if any('fantastic' in col.lower() for col in row[1:]))
    ids = [row[0] for row in has_fantastic]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM