读取文本文件 Python 中的混合数据类型

Question

I have been given some 'reports' from another piece of software that contains data that I need to use.我收到了来自另一个软件的一些“报告”，其中包含我需要使用的数据。 The file is quite simple.该文件非常简单。 It has a description line that starts with a # that is the variable name/description.它有一个以# 开头的描述行，它是变量名称/描述。 Followed by comma seperated data on the next line.下一行是逗号分隔的数据。

eg例如

    #wavelength,'<a comment describing the data>'
    400.0,410.0,420.0, <and so on>
    #reflectance,'<a comment describing the data>'
    0.001,0.002,0.002, <and so on>
    #date,'time file was written'
    2012-03-06 13:12:36.694597  < this is the bit that stuffs me up!! >

When I first typed up some code I expected all the data to be read as floats.当我第一次输入一些代码时，我希望所有数据都被读取为浮点数。 But I have discovered some dates and strings.但我发现了一些日期和字符串。 For my purposes All I care about is the data that should be arrays of floats.出于我的目的，我只关心应该是浮点数组的数据。 Everything else I read in (such as dates) can be treated as a strings (even if they are technically a date for example).我读到的其他所有内容（例如日期）都可以视为字符串（即使它们在技术上是日期）。

My first attempt - which worked until I found non-floats - basically ignores the # then grabs the chars proceeding it making a dictionary with the Key that is the chars it just read.我的第一次尝试 - 在我发现非浮点数之前一直有效 - 基本上忽略了 # 然后抓取字符继续它制作一个字典，其中 Key 是它刚刚读取的字符。 Then I made the entry for the key an array by splitting on the commas and stacking on rows for 2-d data.然后，我通过在逗号上拆分并在二维数据的行上堆叠，将键的条目设为数组。 Similar to the next section of code.类似于下一段代码。

    data = f.readlines()
    dataLines = data.split('\n')

    for i in range(0,len(dataLines)-1):
        if dataLines[i][0] == '#':
            key,comment = dataLines[i].split(',')
            keyList.append(key[1:])
            k+=1
        else: # it must be data
            d+=1
            dataList.append(dataLines[i])

        for j in range(0,len(dataList)):
            tmp = dataList[j]

            x = map(float,tmp.split(','))
            tempData = vstack((tempData,asarray(x)))

    self.__report[keyList[k]] = tempData

When I find a non-float in my file the line "x = map(float,tmp.split(','))" fails (there are no commas in the line of data).当我在我的文件中找到非浮点数时，“x = map(float,tmp.split(','))”行失败（数据行中没有逗号）。 I thought I would try and test if it is a string or not using isinstance but the file reader treats all of the data coming in from the file as a string (of course).我想我会尝试测试它是否是字符串或不使用 isinstance 但文件读取器将所有来自文件的数据视为字符串（当然）。 I tried trying to convert the line from the file to a float array, thinking if it fails then just treat it as an array of strings - like this.我尝试尝试将文件中的行转换为浮点数组，认为如果失败，则将其视为字符串数组 - 就像这样。

     try:
         scipy.array(tmp,dtype=float64)  #try to convert
         x = map(float,tmp.split(','))

     except:# ValueError: # must be a string
         x = zeros((1,1))
         x = asarray([tmp])
         #tempData = vstack((tempData,asarray(x)),dtype=str)
         if 'tempData' in locals():
             pass
         else:
             tempData = zeros((len(x)))

         tempData = vstack((tempData,asarray(x)))

This however results as EVERYTHING being read in as a character array and as such, I cannot index the data as a numpy array.然而，这导致所有内容都被作为字符数组读入，因此，我无法将数据索引为 numpy 数组。 All of the data is there in the dictionary but the dtype is s|8, for example.例如，所有数据都在字典中，但 dtype 是 s|8。 It seems the try block is going straight to the exception.似乎 try 块直接进入异常。

I would appreciate any advice on getting this to work so I can discriminate between floats and strings.我将不胜感激任何有关使其工作的建议，以便我可以区分浮点数和字符串。 I don't know the order of the data before I get the report.在拿到报告之前，我不知道数据的顺序。

Also, the big files can take quite a long time to load in to memory, any advice on how to make this more efficient would also be appreciated.此外，大文件可能需要很长时间才能加载到内存中，任何有关如何提高效率的建议也将不胜感激。

Thanks谢谢

Answer 1

I'm assuming that finally you are interested in the x which should be in the format [400.0, 410.0, 420.0] .我假设你最终对x感兴趣，它的格式应该是[400.0, 410.0, 420.0] 。

One way to handle this is separating the splitting by command and converting to float operations in two different statements, so that you can catch ValueError when you get string elements instead of float or int .处理此问题的一种方法是将按命令拆分并转换为两个不同语句中的浮点操作，以便在获取字符串元素而不是float或int时捕获ValueError 。

keyList = []
dataList = []
with open('sample_data','r') as f:
    for line in f.readline():
        if line.startswith("#"):
            key, comment = line.split(',')
            keyList.append(key[1:])
        else: # it must be data
            dataList.append(line)

for data in dataList:
    data_list = data.split(',')
    try:
        x = map(float, data_list)
    except ValueError:
        pass

Also notice other minor changes that I've done to your code which makes it more pythonic in nature.还要注意我对您的代码所做的其他细微更改，这使其在本质上更加 Pythonic。

Answer 2

Write a Python program to create a file of elements of any data type (mixed data type, ie some elements maybe of type int, some elements of type float and some elements of type string).编写一个 Python 程序来创建一个包含任何数据类型元素的文件（混合数据类型，即某些元素可能是 int 类型，某些元素可能是 float 类型，某些元素可能是 string 类型）。 Split this file into three file containing elements of same data type (ie 1st file of integers only, 2nd file of float only and 3rd file of strings only).将此文件拆分为包含相同数据类型元素的三个文件（即只有整数的第一个文件，只有浮点数的第二个文件和只有字符串的第三个文件）。 Take input from the user to create the file.从用户获取输入以创建文件。

f = open('MixedFile.txt','w')
while True :
    user = input("Enter Any Data Type Element :: ")
    if user == 'end':
        print('!!!!!!!! EXIT !!!!!!!!!!!!')
        break
    else :
        f.write(user + '\n')
f.close()
f = open('MixedFile.txt')
a = []
a = f.read().split()
f.close()
fs = open ('StringFile.txt','w')
ff = open ('FloatFile.txt','w')
fn = open ('NumberFile.txt','w')
for i in a :
    try:
        int(i)
        fn.write(i + '\n')
    except:
            try:
                float(i)
                ff.write(i + '\n')
            except:
                fs.write(i + '\n')
f.close()
fs.close()
fn.close()
ff.close()

print("reading................")
fs = open ('StringFile.txt','r')
ff = open ('FloatFile.txt','r')
fn = open ('NumberFile.txt','r')
print(fs.read())
print(fn.read())
print(ff.read())

Answer 3

this might be a stupid suggestion, but could you just do an additional check这可能是一个愚蠢的建议，但你能不能做一个额外的检查

if ',' in dataLines[i]

before adding the line to your data list?在将该行添加到您的数据列表之前？ Or, if not, write a regular expression to check for a comma-separated list of floating point numbers?或者，如果没有，请编写一个正则表达式来检查以逗号分隔的浮点数列表？

(\d(\.\d+)?)(,\d(\.\d+)?)*

might do the trick (allows integers too).可能会起作用（也允许整数）。

读取文本文件 Python 中的混合数据类型

问题描述

3 个解决方案

解决方案1
3 已采纳 2012-03-06 09:48:36

解决方案2
0 2021-04-22 14:00:46

解决方案3
0 2012-03-06 09:35:12

读取文本文件 Python 中的混合数据类型

问题描述

3 个解决方案

解决方案1 3 已采纳 2012-03-06 09:48:36

解决方案2 0 2021-04-22 14:00:46

解决方案3 0 2012-03-06 09:35:12

解决方案1
3 已采纳 2012-03-06 09:48:36

解决方案2
0 2021-04-22 14:00:46

解决方案3
0 2012-03-06 09:35:12