[英]Parsing unique words from a text file
I'm working on a project to parse out unique words from a large number of text files. 我正在开发一个项目来解析大量文本文件中的独特单词。 I've got the file handling down, but I'm trying to refine the parsing procedure.
我有文件处理,但我正在尝试改进解析过程。 Each file has a specific text segment that ends with certain phrases that I'm catching with a regex on my live system.
每个文件都有一个特定的文本段,以我在实时系统上使用正则表达式捕获的某些短语结尾。
The parser should walk through each line, and check each word against 3 criteria: 解析器应遍历每一行,并根据3个条件检查每个单词:
dict_file
dict_file
The result should be a 2D array, each row a list of unique words per file, which is written to a CSV file using the .writerow(foo)
method after each file is processed. 结果应该是2D数组,每行包含每个文件的唯一字列表,在处理
.writerow(foo)
每个文件后使用.writerow(foo)
方法将其写入CSV文件。
My working code's below, but it's slow and kludgy, what am I missing? 我的工作代码在下面,但它很慢而且很笨拙,我错过了什么?
My production system is running 2.5.1 with just the default modules (so NLTK is a no-go), can't be upgraded to 2.7+. 我的生产系统只使用默认模块运行2.5.1(因此NLTK是禁止的),无法升级到2.7+。
def process(line):
line_strip = line.strip()
return line_strip.translate(punct, string.punctuation)
# Directory walking and initialization here
report_set = set()
with open(fullpath, 'r') as report:
for line in report:
# Strip out the CR/LF and punctuation from the input line
line_check = process(line)
if line_check == "FOOTNOTES":
break
for word in line_check.split():
word_check = word.lower()
if ((word_check not in report_set) and (word_check not in dict_file)
and (len(word) > 2)):
report_set.append(word_check)
report_list = list(report_set)
Edit: Updated my code based on steveha's recommendations. 编辑:根据steveha的建议更新了我的代码。
One problem is that an in
test for a list
is slow. 一个问题是,一个
in
测试的list
是缓慢的。 You should probably keep a set
to keep track of what words you have seen, because the in
test for a set
is very fast. 您应该保留一
set
以跟踪您所看到的单词,因为对于set
的in
测试非常快。
Example: 例:
report_set = set()
for line in report:
for word in line.split():
if we_want_to_keep_word(word):
report_set.add(word)
Then when you are done: report_list = list(report_set) 完成后:report_list = list(report_set)
Anytime you need to force a set
into a list
, you can. 任何时候你需要强制一个
set
到list
,你可以。 But if you just need to loop over it or do in
tests, you can leave it as a set
; 但是如果你只是需要循环它或者
in
测试中做,你可以把它作为一set
; it's legal to do for x in report_set:
for x in report_set:
Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the .readlines()
method. 另一个可能或可能不重要的问题是,您使用
.readlines()
方法一次性地从文件中.readlines()
所有行。 For really large files it is better to just use the open file-handle object as an iterator, like so: 对于非常大的文件,最好只使用open file-handle对象作为迭代器,如下所示:
with open("filename", "r") as f:
for line in f:
... # process each line here
A big problem is that I don't even see how this code can work: 一个大问题是我甚至没有看到这段代码如何工作:
while 1:
lines = report.readlines()
if not lines:
break
This will loop forever. 这将永远循环。 The first statement slurps all input lines with
.readlines()
, then we loop again, then the next call to .readlines()
has report
already exhausted, so the call to .readlines()
returns an empty list, which breaks out of the infinite loop. 第一个语句使用
.readlines()
覆盖所有输入行,然后我们再次循环,然后下一次调用.readlines()
report
已经用尽,因此对.readlines()
的调用返回一个空列表,该列表突破了无限循环。 But this has now lost all the lines we just read, and the rest of the code must make do with an empty lines
variable. 但是现在这已经丢失了我们刚读过的所有行,其余的代码必须使用空
lines
变量。 How does this even work? 这怎么工作?
So, get rid of that whole while 1
loop, and change the next loop to for line in report:
. 因此,
while 1
循环中摆脱整个,并for line in report:
中将下一循环更改为for line in report:
Also, you don't really need to keep a count
variable. 此外,您实际上不需要保留
count
变量。 You can use len(report_set)
at any time to find out how many words are in the set
. 您可以随时使用
len(report_set)
来查找set
有多少单词。
Also, with a set
you don't actually need to check whether a word is in
the set; 另外,使用一个
set
您实际上不需要检查单词是否in
集合中; you can just always call report_set.add(word)
and if it's already in the set
it won't be added again! 您可以随时调用
report_set.add(word)
,如果它已经在set
,则不会再次添加!
Also, you don't have to do it my way, but I like to make a generator that does all the processing. 此外,您不必我的方式去做,但我喜欢做一个发电机,做所有的处理。 Strip the line, translate the line, split on whitespace, and yield up words ready to use.
剥离线条,平移线条,拆分空白区域,并准备好使用的单词。 I would also force the words to lower-case except I don't know whether it's important that
FOOTNOTES
be detected only in upper-case. 我也会强制说小写,但我不知道只有大写才能检测到
FOOTNOTES
是否重要。
So, put all the above together and you get: 所以,把以上所有内容放在一起,你得到:
def words(file_object):
for line in file_object:
line = line.strip().translate(None, string.punctuation)
for word in line.split():
yield word
report_set = set()
with open(fullpath, 'r') as report:
for word in words(report):
if word == "FOOTNOTES":
break
word = word.lower()
if len(word) > 2 and word not in dict_file:
report_set.add(word)
print("Words in report_set: %d" % len(report_set))
Try replacing report_list with a dictionary or set. 尝试使用字典或集替换report_list。 word_check not in report_list works slow if report_list is a list
如果report_list是列表,则word_check不在report_list中工作很慢
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.