将 Python 中的 a.txt 文件解析为 Numpy 数组

Question

I have a bunch of.txt files with metrics with the following formatting:我有一堆带有以下格式的指标的 .txt 文件：

|Jaccard: 0.6871114980646424 
|Dice: 0.8145418946558747 
|Volume Similarity: -0.0006615037672849326 
|False Positives: 0.18572742753126772 
|False Negatives: 0.185188604940396

I would like to read them all (around 700) and store each value to a numpy array, so I could get statistics like average jaccard, average dice, etc.我想全部阅读它们（大约 700 个）并将每个值存储到 numpy 数组中，这样我可以获得平均 jaccard、平均骰子等统计数据。

How could I do that?我怎么能那样做？

Answer 1

This would be my approach.这将是我的方法。 The result is a dictionary with with all metrics in an array eg结果是一个字典，其中包含数组中的所有指标，例如

 {"|Jaccard" : array...,
....}

Code might look like this:代码可能如下所示：

import numpy as np
import os

pathtodir = "filedir"
d = {}
for file in os.listdir(pathtodir):
    with open(file, "r") as of:
        lines = of.readlines()
    for line in lines:
        k, v = line.split(": ")
        if k in d.keys():
            d[k].append(v)
        else:
            d[k] = [v]

for k in d:
    d[k] = np.array(d[k])

Answer 2

You could use genfromtxt() from numpy.您可以使用genfromtxt() 。 See https://numpy.org/doc/1.18/reference/generated/numpy.genfromtxt.html .请参阅https://numpy.org/doc/1.18/reference/generated/numpy.genfromtxt.html 。 Use':' as delimiter and extract a string followed by a float.使用':' 作为分隔符并提取一个字符串，后跟一个浮点数。

data = np.genfromtxt(path, delimiter=":", dtype='S64,f4')

Parsed the file and produced following data :解析文件并产生以下data ：

(b'|Jaccard',  6.8711150e-01) (b'|Dice',  8.1454188e-01)
 (b'|Volume Similarity', -6.6150376e-04)
 (b'|False Positives',  1.8572743e-01)
 (b'|False Negatives',  1.8518861e-01)]

Answer 3

I prefer to open each file and save its content in a pandas.DataFrame .我更喜欢打开每个文件并将其内容保存在pandas.DataFrame中。 The clear advantage respect to numpy.array is that it is easier to perform later statistics. numpy.array的明显优势是更容易执行以后的统计。 Check this code:检查此代码：

import pandas as pd
import os

pathtodir = r'data' # write the name of the subfolder where your file are stored
df = pd.DataFrame()
file_count = 0

for file in os.listdir(pathtodir):
    with open(os.path.join(pathtodir, file), 'r') as of:
        lines = of.readlines()
    for line in lines:
        header, value = line.split(':')
        value = float(value.replace(' ','').replace('\n', ''))
        if header not in df.columns:
            df[header] = ''
        df.at[file_count, header] = value
    file_count += 1

for column in df.columns:
    df[column] = df[column].astype(float)

With 4 example files, I get this dataframe:通过 4 个示例文件，我得到了这个 dataframe：

print(df.to_string())

    Jaccard      Dice  Volume Similarity  False Positives  False Negatives
0  0.687111  0.814542          -0.000662         0.185727         0.185189
1  0.345211  0.232542          -0.000455         0.678547         0.156752
2  0.623451  0.813345          -0.000625         0.132257         0.345519
3  0.346111  0.223454          -0.000343         0.453727         0.134586

And some statistics on the fly:还有一些即时统计数据：

print(df.describe())

        Jaccard      Dice  Volume Similarity  False Positives  False Negatives
count  4.000000  4.000000           4.000000         4.000000         4.000000
mean   0.500471  0.520971          -0.000521         0.362565         0.205511
std    0.180639  0.338316           0.000149         0.253291         0.095609
min    0.345211  0.223454          -0.000662         0.132257         0.134586
25%    0.345886  0.230270          -0.000634         0.172360         0.151210
50%    0.484781  0.522944          -0.000540         0.319727         0.170970
75%    0.639366  0.813644          -0.000427         0.509932         0.225271
max    0.687111  0.814542          -0.000343         0.678547         0.345519

将 Python 中的 a.txt 文件解析为 Numpy 数组

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-06-03 21:11:16

解决方案2
0 2020-06-03 21:24:03

解决方案3
0 2020-06-03 22:43:41

将 Python 中的 a.txt 文件解析为 Numpy 数组

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-06-03 21:11:16

解决方案2 0 2020-06-03 21:24:03

解决方案3 0 2020-06-03 22:43:41

解决方案1
1 已采纳 2020-06-03 21:11:16

解决方案2
0 2020-06-03 21:24:03

解决方案3
0 2020-06-03 22:43:41