從多個TXT文件中提取數據並在Python中創建摘要CSV文件

Question

我有一個包含約50個.txt文件的文件夾，其中包含以下格式的數據。

=== Predictions on test data ===

 inst#     actual  predicted error distribution (OFTd1_OF_Latency)
     1        1:S        2:R   +   0.125,*0.875 (73.84)

我需要編寫一個結合以下內容的程序：我的索引號（i），真實類別的字母（R或S），預測類別的字母以及每個分布預測（小數小於1.0）。

完成后，我希望它看起來像以下內容，但最好是一個.csv文件。

ID   True   Pred   S      R
1    S      R      0.125  0.875
2    R      R      0.105  0.895
3    S      S      0.945  0.055
.    .      .      .      .
.    .      .      .      .
.    .      .      .      .
n    S      S      0.900  0.100

我是一個初學者，對如何解析所有內容然后進行連接和附加有些模糊。 這就是我的想法，但如果可以更輕松地建議另一個方向。

for i in range(1, n):
   s = str(i)
   readin = open('mydata/output/output'+s+'out','r')
   #The files are all named the same but with different numbers associated
   output = open("mydata/summary.csv", "a")
   storage = []
   for line in readin:
     #data extraction/concatenation here
     if line.startswith('1'):
        id = i
        true = # split at the ':' and take the letter after it
        pred = # split at the second ':' and take the letter after it
         #some have error '+'s and some don't so I'm not exactly sure what to do to get the distributions
        ds = # split at the ',' and take the string of 5 digits before it
        if pred == 'R':
           dr = #skip the character after the comma but take the have characters after
        else: 
           #take the five characters after the comma
        lineholder = id+' , '+true+' , '+pred+' , '+ds+' , '+dr
     else: continue
   output.write(lineholder)

我認為使用索引將是另一種選擇，但是如果在任何文件中都沒有間距並且我還沒有確定的話，它可能會使事情復雜化。

謝謝您的幫助！

Answer 1

首先，如果要使用CSV，則應使用python隨附的CSV模塊。 有關此模塊的更多信息，請訪問： https : //docs.python.org/2.7/library/csv.html我不會演示如何使用它，因為它非常簡單。

至於讀取輸入數據，這是我的建議，如何分解數據本身的每一行。 我假設輸入文件中的數據行的值用空格分隔，並且每個值都不能包含空格：

def process_line(id_, line):
    pieces = line.split() # Now we have an array of values
    true = pieces[1].split(':')[1] # split at the ':' and take the letter after it
    pred = pieces[2].split(':')[1] # split at the second ':' and take the letter after it
    if len(pieces) == 6: # There was an error, the + is there
        p4 = pieces[4]
    else: # There was no '+' only spaces
        p4 = pieces[3]
    ds = p4.split(',')[0] # split at the ',' and take the string of 5 digits before it
    if pred == 'R':
        dr = p4.split(',')[0][1:] #skip the character after the comma but take the have??? characters after
    else:
        dr = p4.split(',')[0]
    return id_+' , '+true+' , '+pred+' , '+ds+' , '+dr

我在這里主要使用的是字符串的拆分函數： https : //docs.python.org/2/library/stdtypes.html#str.split並在一個地方使用了str [1：]的簡單語法來跳過第一個字符字符串（畢竟字符串是數組，我們可以使用此切片語法）。

請記住，我的函數將不會處理任何錯誤或與您作為示例發布的行格式不同的行。 如果每行中的值都由制表符分隔而不是空格，則應替換此行： pieces = line.split()改為pieces = line.split('\\t') 。

Answer 2

我認為您可以分離並在re模塊的幫助下將其與字符串結合起來，如下所示：

import re
file = open('sample.txt','r')
strings=[[num for num in re.findall(r'\d+\.+\d+',i) for i in file.readlines()]]
print (strings)
file.close()
file = open('sample.txt','r')
num=[[num for num in re.findall(r'\w+\:+\w+',i) for i in file.readlines()]]
print (num)
s= num+strings
print s #[['1:S','2:R'],['0.125','0.875','73.84']] output of the code

該編是為一行編寫的，您也可以將其用於多行，但是您需要為此使用循環

sample.txt的內容：1 1：S 2：R + 0.125，* 0.875（73.84）

2 1：S 2：R + 0.15，* 0.85（69.4）

當您運行編時，結果將為：[['1：S，'2：R']，['1：S'，'2：R']，['0.125'，'0.875'，'73.84' ]，[ '0.15' 0.85，'69 0.4' ]]

只需串聯它們

Answer 3

這使用正則表達式和CSV模塊。

import re
import csv

matcher = re.compile(r'[[:blank:]]*1.*:(.).*:(.).* ([^ ]*),[^0-9]?(.*) ')
filenametemplate = 'mydata/output/output%iout'

output = csv.writer(open('mydata/summary.csv', 'w'))

for i in range(1, n):
    for line in open(filenametemplate % i):
        m = matcher.match(line)
        if m:
           output.write([i] + list(m.groups()))

從多個TXT文件中提取數據並在Python中創建摘要CSV文件

問題描述

3 個解決方案

解決方案1
0 已采納 2014-07-25 09:32:36

解決方案2
0 2014-07-25 10:37:59

解決方案3
0 2014-07-25 10:44:54

從多個TXT文件中提取數據並在Python中創建摘要CSV文件

問題描述

3 個解決方案

解決方案1 0 已采納 2014-07-25 09:32:36

解決方案2 0 2014-07-25 10:37:59

解決方案3 0 2014-07-25 10:44:54

解決方案1
0 已采納 2014-07-25 09:32:36

解決方案2
0 2014-07-25 10:37:59

解決方案3
0 2014-07-25 10:44:54