简体   繁体   English

使用python计数平面文件中的字符串出现

[英]Using python to count string occurrences in a flat file

I'm trying to complete an online course and one question is to count the number of occurrences of the word "fantastic" in a large file. 我正在尝试完成在线课程,一个问题是计算大文件中“神奇”一词的出现次数。 When an occurrence is found the first element of that line needs to be stored (the id) to build up a list of lines(ids) containing the word. 当发现一个事件时,需要存储该行的第一个元素(ID)以建立包含该单词的行(ID)的列表。 So far I have the below which is reading the lines correctly but I can't figure out how to check if "fantastic" is somewhere in that line in upper/lower case. 到目前为止,我有下面的内容可以正确读取行,但是我不知道如何检查“奇妙”是否在该行中的大写/小写位置。 I've tried using row.count('fantastic') which didn't work as I'm not sure how csv reader stores the lines, if I can get them counting I can just add the id to array and print it at the end when once or more occurrences are found per line. 我已经尝试使用row.count('fantastic') ,但是我不确定csv阅读器如何存储行,如果我可以让它们计数,我可以将id添加到array并将其打印在当每行发现一次或多次出现时结束。

#!/usr/bin/python
import sys
import csv

def main():
    f = open("test_file.txt", 'rt')
    filereader = csv.reader(f, delimiter='      ', quotechar='"')
    for row in filereader:
        print row[0]
        print row.count('fantastic')

if __name__ == "__main__":
    main()

Below is a very small sample set where I've thrown in a few fantastic's. 下面是一个非常小的样本集,在其中我提出了一些奇妙的内容。

"6361"  "When will unit 2 be online? fantastic"   "cs101 unit2"   "100003292"     "<p>When will unit 2 be online?</p>"    "question"      "\N"    "\N"    "2012-02-26 15:47:12.522262+00" "0"     "(closed)"      "51919" "100003292"     "2012-03-03 10:12:27.41521+00"  "21196" "\N"    "\N"    "186"   "t"
"7185"  "Hungarian group"       "cs101 hungarian nationalities" "100003268"     "<p>Hi there! This is FANTASTIC</p>
<p>Any Hungarians doing the course? We could form a group!<br>
;)</p>" "question"      "\N"    "\N"    "2012-02-27 15:09:11.184434+00" "0"     ""      "\N"    "100003268"     "2012-02-27 15:09:11.184434+00" "9322"  "\N"    "\N"    "106"   "f"
"26454" "Course Application."   "cs101 application."    "100003192"     "<p>Please tell about the Course Application.  How to use the Course for higher education and jobs?</p>" "question"      "\N"    "\N"    "2012-03-08 08:34:06.704674+00" "-1"    ""      "\N"    "100003192"     "2012-03-08 08:34:06.704674+00" "34477" "\N"    "\N"    "73"    "f"

I would expect the output to be 6361, 7185 我希望输出为6361,7185

You are close. 你近了

First, make sure that those are not tabs rather than spaces. 首先,请确保这些不是制表符,而不是空格。

Second, if you use csv, the result is a list for each row. 其次,如果使用csv,则结果是每一行的列表。 You need to check each string in the list. 您需要检查列表中的每个字符串。 You can either use any or join to make a single string. 您可以使用anyjoin一个单个字符串。

Third, you need to use lower() since 'FANTASTIC' is not the same as 'fantastic' 第三,您需要使用lower()因为“ FANTASTIC”与“ fantastic”不同

import csv

def main():
    f = open("test_file.txt", 'rt')
    filereader = csv.reader(f, delimiter='\t')
    for row in filereader:
        if any('fantastic' in e.lower() for e in row[1:]):
            print row[0]

To gather all the rows into a list, you might do something like: 要将所有行收集到列表中,可以执行以下操作:

def main():
    result=[]
    with open("/tmp/so.csv", 'rt') as f:
        filereader = csv.reader(f, delimiter='\t', quotechar='"')
        for row in filereader:
            if any('fantastic' in e.lower() for e in row[1:]):
                result.append(row[0])
    print result       

The default quote character is already " so you don't need to specify that, but if you've got a tab delimited file, passing in '\\t' as the delimiter will correctly interpret the columns. 默认的引号字符已经是"因此您无需指定该字符,但是如果您有制表符分隔的文件,请输入'\\t'作为分隔符将正确地解释各列。

What you can do is build a generator to filter rows based on whether the substring 'fantastic' appears in any columns after the ID, then use a list comprehension to extract the IDs, eg: 您可以做的是构建一个生成器,根据子字符串'fantastic'是否出现在ID后面的任何列中来过滤行,然后使用列表推导来提取ID,例如:

with open('test_file.txt') as fin:
    csvin = csv.reader(fin, delimiter='\t')
    has_fantastic = (row for row in csvin if any('fantastic' in col.lower() for col in row[1:]))
    ids = [row[0] for row in has_fantastic]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM