简体   繁体   English

如何在文本文件中查找短语

[英]How to find phrases in a text file

My text file is this: 我的文本文件是这样的:

123 Numbers 4.5
456 Words 6.7
789 Sentences 8.9

And my code is this: 我的代码是这样的:

s = open('test.txt', 'r')
file = s.read()
numbers, words, decimals = [], [], []

I've gotten thus far, and i'm trying to work out how to create a list for all the numbers, words and decimals in the file. 到目前为止,我已经在尝试找出如何为文件中的所有数字,单词和小数创建列表。 I've heard you can use the split method, so i tried this: 我听说您可以使用split方法,所以我尝试了以下方法:

with open('test.txt', 'r') as f:
    for line in f:
        numbers, words, decimals = f.split(","), f.split(","), f.split(",")

I did this assuming it would split every time it encountered a space, but that didn't happen, i just got the error: 我这样做是假设它每次遇到一个空间都会分裂,但这没有发生,我只是报错:

AttributeError: '_io.TextIOWrapper' object has no attribute 'split'

Any help would be appreciated. 任何帮助,将不胜感激。 If any elaboration is necessary on what i want to do please tell me, i'm aware this may have been worded poorly. 如果需要对我想做的事情进行详细说明,请告诉我,我知道这可能措辞很差。

First of all, the text file you've posted does not have commas separating the columns, so splitting the string at commas won't work. 首先,您发布的文本文件没有逗号分隔各列,因此以逗号分隔字符串将不起作用。 If you can trust that every line of the file will be identical in structure, then you can simply change your code to be 如果您可以相信文件的每一行在结构上都是相同的,则只需将代码更改为

numbers, words, decimals = [], [], []
with open('test.txt', 'r') as f:
    for line in f:
        number, word, decimal = line.split() 
        numbers.append(number)
        words.append(word)
        decimals.append(decimal)
with open('test.txt', 'r') as f:
    numbers, words, decimals = zip(*(line.split() for line in f))

You want to split each line into fields 您想将每一行拆分为多个字段

with open('test.txt', 'r') as f:
    for line in f:
        number, word, decimal = line.split()  # split on whitespace as indicated by your example file which does not use commas
        numbers.append(int(number))
        words.append(word)
        decimals.append(float(decimal))

If you really intend to use ral decimals than you should use decimal.Decimal instead of float . 如果您确实打算使用ral十进制,则应该使用decimal.Decimal而不是float

Unless you are constrained in some way, I'd recommend using some library designed for working with tabular data, eg pandas where all this would be just 除非您受到某种方式的约束,否则我建议您使用一些设计用于处理表格数据的库,例如熊猫,其中所有这些都只是

import pandas as pd
df = pd.read_table('test.txt', delim_whitespace=True)

It should be line.split and not f.split since you're splitting the line and not the file. 它应该是line.split而不是f.split因为要分割行而不是文件。 Also, you're separating your file on commas but the example file is separated by spaces? 另外,您要用逗号分隔文件,但示例文件是否用空格分隔? If it is separated by spaces you need to use line.split(" ") Also, using with open() as f you don't need to open you're file beforehand or close it afterwards as it sorts that for you. 如果用空格隔开,则需要使用line.split(" ")此外,将with open() as f使用时with open() as f无需事先打开文件或在文件关闭line.split(" ")对其进行排序,因为它会为您排序。 Also, you were saving the entire line split array to each variable and overwriting them each time. 另外,您将整个行拆分数组保存到每个变量,并每次都覆盖它们。 Overall code: 总体代码:

numbers, words, decimals = [], [], []
with open('test.txt', 'r') as f:
    for line in f:
        numbers.append(line.split(" ")[0])
        words.append(line.split(" ")[1])
        decimals.append(line.split(" ")[2])

If I understand your question correctly what you should be looking at is actually nltk . 如果我正确理解了您的问题,那么您实际上应该看的是nltk That will give you an insight in how to tokenize your text based either on words or sentences. 这将使您深入了解如何基于单词或句子对文本进行标记。 The rest should be easy. 其余的应该很容易。

a,b,c=[],[],[]
with open('new.txt', 'r') as f:
for line in f:
    m=line.split()
    a.append(m[0])
    b.append(m[1])
    c.append(m[2])
print a,b,c

Check if this is what you wanted to achieve. 检查这是否是您想要实现的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM