划分列表中的元素以規范化Python中的數據

Question

我正在嘗試用Python編寫一個腳本，該腳本通過將所有值元素除以max值元素來規范化數據集。

到目前為止，這是我提出的腳本：

#!/usr/bin/python

with open("infile") as f:
    cols = [float(row.split("\t")[2]) for row in f.readlines()]
    maxVal = max(cols)
    #print maxVal

    data = []
    with open('infile') as f2:
        for line in f2:                  
            items = line.split() # parse the columns
            tClass, feats, values = items[:3] # parse the columns
            #print items      
            normalizedData = float(values)/float(maxVal)
            #print normalizedData

            with open('outfile', 'wb') as f3:
            output = "\t".join([tClass +"\t"+ feats, str(normalizedData)])
            f3.write(output + "\n")

其中的目標是獲取一個輸入文件（制表符分隔的三列），例如：

lfr about-kind-of+n+n-the-info-n    3.743562
lfr about+n-a-j+n-a-dream-n 2.544614
lfr about+n-a-j+n-a-film-n  1.290925
lfr about+n-a-j+n-a-j-series-n  2.134124

在第三列中查找maxVal：在這種情況下為3.743562
將第3列中的所有值除以maxVal
輸出以下所需結果：

 lfr about-kind-of+n+n-the-info-n 1 lfr about+na-j+na-dream-n 0.67973 lfr about+na-j+na-film-n 0.34483 lfr about+na-j+naj-series-n 0.57007

但是，當前“輸出”的只是一個值，我假設它是輸入數據中的第一個值除以最大值。 關於我的代碼出了什么問題的任何見解：為什么輸出只打印一行？ 對解決方案有何見解？ 先感謝您。

Answer 1

您需要打開輸出文件一次，並在處理輸入行時繼續對其進行寫入。 如果您使用csv模塊來處理輸入和輸出，也將變得更加容易：

import csv

with open("infile", 'rb') as inf:
    reader = csv.reader(inf, delimiter='\t')
    maxVal = max(float(row[2]) for row in reader)

with open('infile') as inf, open('outfile') as outf:
    reader = csv.reader(inf, delimiter='\t')
    writer = csv.writer(outf, delimiter='\t')
    for row in reader:
        tClass, feats, values = row[:3]

        normalizedData = float(values) / maxVal

        writer.writerow([tClass, feats, values])

Answer 2

據我了解您的意圖，請按照以下說明進行工作。 （較小的程序流更正）

另外，我沒有選擇連續寫入文件，而是選擇存儲要寫入的內容，然后將所有內容轉儲到輸出文件中。

更新 -事實證明， list創建與多余的時間with使用語句的時間相同，因此，完全擺脫了它。 現在，連續寫入文件，而不必每次都關閉它。

with open("in.txt") as f:
    cols = [float(row.split()[2]) for row in f.readlines()]
    maxVal = max(cols)
    #print maxVal

data = list()
f3 = open('out.txt', 'w')
with open('in.txt') as f2:
    for line in f2:
        items = line.split() # parse the columns
        tClass, feats, values = items[:3] # parse the columns
        #print items
        normalizedData = float(values)/float(maxVal)
        #print normalizedData

        f3.write("\t".join([tClass +"\t"+ feats, str(normalizedData), "\n"]))
f3.close()

Answer 3

#!/usr/bin/python

with open("lfr") as f:
    cols = [float(row.split("\t")[2]) for row in f.readlines()]
    maxVal = max(cols)
    #print maxVal

    data = []
    output1 = ''
    with open('lfr') as f2:
        for line in f2:                  
            items = line.split() # parse the columns
            tClass, feats, values = items[:3] # parse the columns
            #print items      
            normalizedData = float(values)/float(maxVal)
            output1 += tClass + "\t" + feats + "\t" + str(normalizedData) + "\n"

            with open('outfile', 'wb') as f3:
                output = output1
                f3.write(output + "\n")

我也一直在努力，似乎我沒有通過附加每個循環的結果來創建輸出變量。 但是，似乎有點慢（處理4MB文件需要2秒）。 可以優化嗎？

划分列表中的元素以規范化Python中的數據

問題描述

3 個解決方案

解決方案1
1 2013-12-01 10:28:18

解決方案2
1 已采納 2013-12-01 10:33:39

解決方案3
0 2013-12-01 10:48:05

划分列表中的元素以規范化Python中的數據

問題描述

3 個解決方案

解決方案1 1 2013-12-01 10:28:18

解決方案2 1 已采納 2013-12-01 10:33:39

解決方案3 0 2013-12-01 10:48:05

解決方案1
1 2013-12-01 10:28:18

解決方案2
1 已采納 2013-12-01 10:33:39

解決方案3
0 2013-12-01 10:48:05