简体   繁体   English

在Python中使用字符串和浮点数从大型文本文件中读取数据

[英]Reading data from large text file with strings and floats in Python

I'm having trouble reading large amounts of data from a text file, and splitting and removing certain objects from it to get a more refined list. 我无法从文本文件中读取大量数据,并且无法从中拆分和删除某些对象以获得更完善的列表。 For example, let's say I have a text file, we'll call it 'data.txt', that has this data in it. 例如,假设我有一个文本文件,我们将其称为“ data.txt”,其中包含此数据。

Some Header Here
Object Number = 1
Object Symbol = A
Mass of Object = 1
Weight of Object = 1.2040
Hight of Object = 0.394
Width of Object = 4.2304

Object Number = 2
Object Symbol = B
Mass Number = 2
Weight of Object = 1.596
Height of Object = 3.293
Width of Object = 4.654
.
.
. ...Same format continuing down

My problem is taking the data I need from this file. 我的问题是从此文件中获取所需的数据。 Let's say I'm only interested in the Object Number and Mass of Object, which repeats through the file, but with different numerical values. 假设我只对“对象编号”和“对象质量”感兴趣,它们在文件中重复但具有不同的数值。 I need a list of this data. 我需要这些数据的清单。 Example

Object Number    Mass of Object
1                1
2                2
.                .
.                .
.                .
etc.

With the headers excluded of course, as this data will be applied to an equation. 当然不包括标题,因为此数据将应用于方程式。 I'm very new to Python, and don't have any knowledge of OOP. 我对Python还是很陌生,对OOP没有任何了解。 What would be the easiest way to do this? 最简单的方法是什么? I know the basics of opening and writing to text files, even a little bit of using the split and strip functions. 我知道打开和写入文本文件的基础知识,甚至包括一点点使用split和strip功能。 I've researched quite a bit on this site about sorting data, but I can't get it to work for me. 我已经在这个网站上研究了很多有关数据排序的方法,但是我无法让它对我有用。

Try this: 尝试这个:

object_number = [] # list of Object Number
mass_of_object = [] # list of Mass of Object
with open('data.txt') as f:
    for line in f:
        if line.startswith('Object Number'):
            object_number.append(int(line.split('=')[1]))
        elif line.startswith('Mass of Object'):
            mass_of_object.append(int(line.split('=')[1]))

In my opinion dictionary (and sub-classes) has an efficiency greater than a group of lists for huge data input. 在我看来,字典(和子类)的效率要高于一组用于大量数据输入的列表。

Moreover, my code don't need any modification if you need to extract a new object data from your file. 此外,如果您需要从文件中提取新的对象数据,则无需修改我的代码。

from _collections import defaultdict

checklist = ["Object Number", "Mass of Object"]
data = dict()

with open("text.txt") as f:
    # iterating over the file allows
    # you to read it automatically one line at a time
    for line in f:
        for regmatch in checklist:
            if line.startswith(regmatch):
                # this is to erase newline characters
                val = line.rstrip()
                val = val.split(" = ")[1]
                data.setdefault(regmatch, []).append(val)                    

print data

This is the output: 这是输出:

defaultdict(None, {'Object Number': ['1', '2'], 'Mass of Object': ['1']})

Here some theory about speed, here some tips about performance optimization and here about dependency between type of data and implementation efficiency. 这里有一些关于速度的理论, 这里有一些关于性能优化的技巧,以及这里有关数据类型和实现效率之间的依赖性。

Last, some examples about re (regular expression): 最后,关于re (正则表达式)的一些示例:

https://docs.python.org/2/howto/regex.html https://docs.python.org/2/howto/regex.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM