使用python计数平面文件中的字符串出现

Question

I'm trying to complete an online course and one question is to count the number of occurrences of the word "fantastic" in a large file. 我正在尝试完成在线课程，一个问题是计算大文件中“神奇”一词的出现次数。 When an occurrence is found the first element of that line needs to be stored (the id) to build up a list of lines(ids) containing the word. 当发现一个事件时，需要存储该行的第一个元素（ID）以建立包含该单词的行（ID）的列表。 So far I have the below which is reading the lines correctly but I can't figure out how to check if "fantastic" is somewhere in that line in upper/lower case. 到目前为止，我有下面的内容可以正确读取行，但是我不知道如何检查“奇妙”是否在该行中的大写/小写位置。 I've tried using row.count('fantastic') which didn't work as I'm not sure how csv reader stores the lines, if I can get them counting I can just add the id to array and print it at the end when once or more occurrences are found per line. 我已经尝试使用row.count('fantastic') ，但是我不确定csv阅读器如何存储行，如果我可以让它们计数，我可以将id添加到array并将其打印在当每行发现一次或多次出现时结束。

#!/usr/bin/python
import sys
import csv

def main():
    f = open("test_file.txt", 'rt')
    filereader = csv.reader(f, delimiter='      ', quotechar='"')
    for row in filereader:
        print row[0]
        print row.count('fantastic')

if __name__ == "__main__":
    main()

Below is a very small sample set where I've thrown in a few fantastic's. 下面是一个非常小的样本集，在其中我提出了一些奇妙的内容。

"6361"  "When will unit 2 be online? fantastic"   "cs101 unit2"   "100003292"     "<p>When will unit 2 be online?</p>"    "question"      "\N"    "\N"    "2012-02-26 15:47:12.522262+00" "0"     "(closed)"      "51919" "100003292"     "2012-03-03 10:12:27.41521+00"  "21196" "\N"    "\N"    "186"   "t"
"7185"  "Hungarian group"       "cs101 hungarian nationalities" "100003268"     "<p>Hi there! This is FANTASTIC</p>
<p>Any Hungarians doing the course? We could form a group!<br>
;)</p>" "question"      "\N"    "\N"    "2012-02-27 15:09:11.184434+00" "0"     ""      "\N"    "100003268"     "2012-02-27 15:09:11.184434+00" "9322"  "\N"    "\N"    "106"   "f"
"26454" "Course Application."   "cs101 application."    "100003192"     "<p>Please tell about the Course Application.  How to use the Course for higher education and jobs?</p>" "question"      "\N"    "\N"    "2012-03-08 08:34:06.704674+00" "-1"    ""      "\N"    "100003192"     "2012-03-08 08:34:06.704674+00" "34477" "\N"    "\N"    "73"    "f"

I would expect the output to be 6361, 7185 我希望输出为6361，7185

Answer 1

You are close. 你近了

First, make sure that those are not tabs rather than spaces. 首先，请确保这些不是制表符，而不是空格。

Second, if you use csv, the result is a list for each row. 其次，如果使用csv，则结果是每一行的列表。 You need to check each string in the list. 您需要检查列表中的每个字符串。 You can either use any or join to make a single string. 您可以使用any或join一个单个字符串。

Third, you need to use lower() since 'FANTASTIC' is not the same as 'fantastic' 第三，您需要使用lower()因为“ FANTASTIC”与“ fantastic”不同

import csv

def main():
    f = open("test_file.txt", 'rt')
    filereader = csv.reader(f, delimiter='\t')
    for row in filereader:
        if any('fantastic' in e.lower() for e in row[1:]):
            print row[0]

To gather all the rows into a list, you might do something like: 要将所有行收集到列表中，可以执行以下操作：

def main():
    result=[]
    with open("/tmp/so.csv", 'rt') as f:
        filereader = csv.reader(f, delimiter='\t', quotechar='"')
        for row in filereader:
            if any('fantastic' in e.lower() for e in row[1:]):
                result.append(row[0])
    print result

Answer 2

The default quote character is already " so you don't need to specify that, but if you've got a tab delimited file, passing in '\\t' as the delimiter will correctly interpret the columns. 默认的引号字符已经是"因此您无需指定该字符，但是如果您有制表符分隔的文件，请输入'\\t'作为分隔符将正确地解释各列。

What you can do is build a generator to filter rows based on whether the substring 'fantastic' appears in any columns after the ID, then use a list comprehension to extract the IDs, eg: 您可以做的是构建一个生成器，根据子字符串'fantastic'是否出现在ID后面的任何列中来过滤行，然后使用列表推导来提取ID，例如：

with open('test_file.txt') as fin:
    csvin = csv.reader(fin, delimiter='\t')
    has_fantastic = (row for row in csvin if any('fantastic' in col.lower() for col in row[1:]))
    ids = [row[0] for row in has_fantastic]

使用python计数平面文件中的字符串出现

问题描述

2 个解决方案

解决方案1
1 2015-10-08 10:53:49

解决方案2
1 已采纳 2015-10-08 10:58:16

使用python计数平面文件中的字符串出现

问题描述

2 个解决方案

解决方案1 1 2015-10-08 10:53:49

解决方案2 1 已采纳 2015-10-08 10:58:16

解决方案1
1 2015-10-08 10:53:49

解决方案2
1 已采纳 2015-10-08 10:58:16