構建用於數據分析的Python代碼

Question

我為一個數據分析項目編寫了代碼，但是它變得笨拙，我想找到一種更好的結構化方法，以便與他人共享。

為了簡潔起見，我有以下內容：

def process_raw_text(txt_file):
    # do stuff
    return token_text

def tag_text(token_text):
    # do stuff
    return tagged

def bio_tag(tagged):
    # do stuff
    return bio_tagged

def restructure(bio_tagged):
    # do stuff
    return(restructured)

print(restructured)

基本上，我希望程序按順序運行所有功能並打印輸出。

在研究構造此方法的方式時，我閱讀了以下類似的類：

class Calculator():

    def add(x, y):
        return x + y

    def subtract(x, y):
        return x - y

這在構造項目以允許單獨調用各個函數時似乎很有用，例如帶有Calculator.add(x,y)的add函數，但是我不確定這就是我想要的。

我應該尋找一些順序運行的功能（用於構造數據流並提供可讀性）的東西嗎？ 理想情況下，我希望所有函數都在“我可以調用一次的東西”之內，從而依次運行其中的所有東西。

Answer 1

將每個函數的輸出鏈接在一起，作為下一個函數的輸入：

def main():
    print restructure(bio_tag(tag_text(process_raw_text(txt_file))

if __name__ == '__main__':
    main()

@SvenMarnach提出了一個很好的建議。 一個更通用的解決方案是意識到，將輸出重復用作序列中下一個的輸入的想法正是reduce函數要做的事情。 我們想從一些輸入txt_file開始：

def main():
    pipeline = [process_raw_text, tag_text, bio_tag, restructure]
    print reduce(apply, pipeline, txt_file)

Answer 2

您可以僅使用模塊和功能來實現簡單的動態管道。

my_module.py

def 01_process_raw_text(txt_file):
    # do stuff
    return token_text

def 02_tag_text(token_text):
    # do stuff
    return tagged

my_runner.py

import my_module

if __name__ == '__main__':
    funcs = sorted([x in my_module.__dict__.iterkeys() if re.match('\d*.*', x)])

    data = initial_data

    for f in funcs:
        data = my_module.__dict__[f](data)

Answer 3

沒有什么可以阻止您創建一個類（或一組類）來表示要管理的類，這些實現將按順序調用所需的功能。

class DataAnalyzer():
    # ...
    def your_method(self, **kwargs):
        # call sequentially, or use the 'magic' proposed by others
        # but internally to your class and not visible to clients
        pass

功能本身可以在模塊內保持私有狀態，這似乎是實現細節。

構建用於數據分析的Python代碼

問題描述

3 個解決方案

解決方案1
2 已采納 2015-08-21 11:33:56

解決方案2
1 2015-08-21 11:30:20

解決方案3
1 2015-08-21 11:36:23

構建用於數據分析的Python代碼

問題描述

3 個解決方案

解決方案1 2 已采納 2015-08-21 11:33:56

解決方案2 1 2015-08-21 11:30:20

解決方案3 1 2015-08-21 11:36:23

解決方案1
2 已采納 2015-08-21 11:33:56

解決方案2
1 2015-08-21 11:30:20

解決方案3
1 2015-08-21 11:36:23