[英]Parsing a .txt file in Python to a Numpy Array
I have a bunch of.txt files with metrics with the following formatting:我有一堆带有以下格式的指标的 .txt 文件:
|Jaccard: 0.6871114980646424
|Dice: 0.8145418946558747
|Volume Similarity: -0.0006615037672849326
|False Positives: 0.18572742753126772
|False Negatives: 0.185188604940396
I would like to read them all (around 700) and store each value to a numpy array, so I could get statistics like average jaccard, average dice, etc.我想全部阅读它们(大约 700 个)并将每个值存储到 numpy 数组中,这样我可以获得平均 jaccard、平均骰子等统计数据。
How could I do that?我怎么能那样做?
This would be my approach.这将是我的方法。 The result is a dictionary with with all metrics in an array eg
结果是一个字典,其中包含数组中的所有指标,例如
{"|Jaccard" : array...,
....}
Code might look like this:代码可能如下所示:
import numpy as np
import os
pathtodir = "filedir"
d = {}
for file in os.listdir(pathtodir):
with open(file, "r") as of:
lines = of.readlines()
for line in lines:
k, v = line.split(": ")
if k in d.keys():
d[k].append(v)
else:
d[k] = [v]
for k in d:
d[k] = np.array(d[k])
You could use genfromtxt()
from numpy.您可以使用
genfromtxt()
。 See https://numpy.org/doc/1.18/reference/generated/numpy.genfromtxt.html .请参阅https://numpy.org/doc/1.18/reference/generated/numpy.genfromtxt.html 。 Use':' as delimiter and extract a string followed by a float.
使用':' 作为分隔符并提取一个字符串,后跟一个浮点数。
data = np.genfromtxt(path, delimiter=":", dtype='S64,f4')
Parsed the file and produced following data
:解析文件并产生以下
data
:
(b'|Jaccard', 6.8711150e-01) (b'|Dice', 8.1454188e-01)
(b'|Volume Similarity', -6.6150376e-04)
(b'|False Positives', 1.8572743e-01)
(b'|False Negatives', 1.8518861e-01)]
I prefer to open each file and save its content in a pandas.DataFrame
.我更喜欢打开每个文件并将其内容保存在
pandas.DataFrame
中。 The clear advantage respect to numpy.array
is that it is easier to perform later statistics. numpy.array
的明显优势是更容易执行以后的统计。 Check this code:检查此代码:
import pandas as pd
import os
pathtodir = r'data' # write the name of the subfolder where your file are stored
df = pd.DataFrame()
file_count = 0
for file in os.listdir(pathtodir):
with open(os.path.join(pathtodir, file), 'r') as of:
lines = of.readlines()
for line in lines:
header, value = line.split(':')
value = float(value.replace(' ','').replace('\n', ''))
if header not in df.columns:
df[header] = ''
df.at[file_count, header] = value
file_count += 1
for column in df.columns:
df[column] = df[column].astype(float)
With 4 example files, I get this dataframe:通过 4 个示例文件,我得到了这个 dataframe:
print(df.to_string())
Jaccard Dice Volume Similarity False Positives False Negatives
0 0.687111 0.814542 -0.000662 0.185727 0.185189
1 0.345211 0.232542 -0.000455 0.678547 0.156752
2 0.623451 0.813345 -0.000625 0.132257 0.345519
3 0.346111 0.223454 -0.000343 0.453727 0.134586
And some statistics on the fly:还有一些即时统计数据:
print(df.describe())
Jaccard Dice Volume Similarity False Positives False Negatives
count 4.000000 4.000000 4.000000 4.000000 4.000000
mean 0.500471 0.520971 -0.000521 0.362565 0.205511
std 0.180639 0.338316 0.000149 0.253291 0.095609
min 0.345211 0.223454 -0.000662 0.132257 0.134586
25% 0.345886 0.230270 -0.000634 0.172360 0.151210
50% 0.484781 0.522944 -0.000540 0.319727 0.170970
75% 0.639366 0.813644 -0.000427 0.509932 0.225271
max 0.687111 0.814542 -0.000343 0.678547 0.345519
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.