What is the most Pythonic way to modify the function of a function?

Question

I have a function I am using to read in files of a particular format. My function looks likes this:

import csv
from collections import namedtuple

def read_file(f, name, header=True):
    with open(f, mode="r") as infile:
        reader = csv.reader(infile, delimiter="\t")
        if header is True:
            next(reader)
        gene_data = namedtuple("Data", 'id, name, q, start, end, sym')
        for row in reader:
            row = data(*row)
            yield row

I also have another type of file that I would like to read in with this function. However, the other file type needs a few slight parsing steps before I can use the read_file function. For example, trailing periods need to be striped from column q and the characters atr need to be appended to the id column. Obviously, I could create a new function, or add some optional arguments to the existing function, but is there a simple way to modify this function so that it can be used to read in an additional file type(s)? I was thinking of something along the lines of a decorator?

Answer 1

恕我直言，最恐怖的方式是将函数转换为基类，将文件操作拆分为方法，并根据基类在新类中重写这些方法。

Answer 2

Having such a monolithic function that takes a filename instead of an open file is by itself not very Pythonic. You are trying to implement a stream processor here ( file stream -> line stream -> CSV record stream -> [transformator ->] data stream ), so using a generator is actually a good idea. I'd slightly refactor this to be a bit more modular:

import csv
from collections import namedtuple

def csv_rows(infile, header):
    reader = csv.reader(infile, delimiter="\t")
    if header: next(reader)
    return reader

def data_sets(infile, header):
    gene_data = namedtuple("Data", 'id, name, q, start, end, sym')
    for row in csv_rows(infile, header):
        yield gene_data(*row)

def read_file_type1(infile, header=True):
    # for this file type, we only need to pass the caller the raw 
    # data objects
    return data_sets(infile, header)

def read_file_type2(infile, header=True):
    # for this file type, we have to pre-process the data sets 
    # before yielding them. A good way to express this is using a
    # generator expression (we could also add a filtering condition here)
    return (transform_data_set(x) for x in data_sets(infile, header))

# Usage sample:
with open("...", "r") as f:
  for obj in read_file_type1(f):
    print obj

As you can see, we have to pass the header argument all the way through the function chain. This is a strong hint that an object-oriented approach would be appropriate here. The fact that we obviously face a hierarchical type structure here (basic data file, type1, type2) supports this.

Answer 3

I suggest you to create some row iterator like following:

with MyFile('f') as f:
    for entry in f:
        foo(entry)

You can do this by implementing a class for your own files with the following traits:

with ( http://docs.python.org/reference/compound_stmts.html#the-with-statement )
container ( http://docs.python.org/reference/datamodel.html#emulating-container-types )

Next to it you may create some function open_my_file(filename) that determines the file type and returns propriate file object to work with. This might be slightly enterprise way, but it worth to implement if you're dealing with multiple file types.

Answer 4

The object-oriented way would be this:

class GeneDataReader:

    _GeneData = namedtuple('GeneData', 'id, name, q, start, end, sym')

    def __init__(self, filename, has_header=True):
        self._ignore_1st_row = has_header
        self._filename = filename        

    def __iter__():
        for row in self._tsv_by_row():
            yield self._GeneData(*self.preprocess_row(row))

    def _tsv_by_row(self):
        with open(self._filename, 'r') as f:
            reader = csv.reader(f, delimiter='\t')
            if self._ignore_1st_row: 
                next(reader)
            for row in reader:
                yield row 

    def preprocess_row(self, row):
        # does nothing.  override in derived classes
        return row

class SpecializedGeneDataReader(GeneDataReader):

    def preprocess_row(self, row):
        row[0] += 'atr'
        row[2] = row[2].rstrip('.')
        return row

The simplest way would be to modify your currently working code with an extra argument.

def read_file(name, is_special=False, has_header=True):
    with open(name,'r') as infile:
        reader = csv.reader(infile, delimiter='\t')
        if has_header:
            next(reader)
        Data = namedtuple("Data", 'id, name, q, start, end, sym')
        for row in reader:
            if is_special:
                row[0] += 'atr'
                row[2] = row[2].rstrip('.')
            row = Data(*row)
            yield row

If you are looking for something less nested but still procedure based:

def tsv_by_row(name, has_header=True):
    with open(f, 'r') as infile: # 
        reader = csv.reader(infile, delimiter='\t')
        if has_header: next(reader)
        for row in reader:
            yield row

def gene_data_from_vanilla_file(name, has_header=True):
    for row in tsv_by_row(name, has_header):
        yield gene_data(*row)

def gene_data_from_special_file(name, has_header=True):
    for row in tsv_by_row(name, has_header):
        row[0] += 'atr'
        row[2] = row[2].rstrip('.')
        yield GeneData(*row)

Answer 5

如何将回调函数传递给read_file（）

Answer 6

In the spirit of Niklas B.'s answer:

import csv, functools
from collections import namedtuple

def consumer(func):
    @functools.wraps(func)
    def start(*args, **kwargs):
        g = func(*args, **kwargs)
        g.next()
        return g
    return start

def csv_rows(infile, header, dest):
    reader = csv.reader(infile, delimter='\t')
    if header: next(reader)
    for line in reader:
        dest.send(line)

@consumer
def data_sets(dest):
    gene_data = namedtuple("Data", 'id, name, q, start, end, sym')
    while 1:
        row = (yield)
        dest.send(gene_data(*row))

def read_file_1(fn, header=True):
    results, sink = getsink()
    csv_rows(fn, header, data_sets(sink))
    return results

def getsink():
    r = []
    @consumer
    def _sink():
        while 1:
            x = (yield)
            r.append(x)
    return (r, _sink())

@consumer
def transform_data_sets(dest):
    while True:
        data = (yield)
        dest.send(data[::-1]) # or whatever

def read_file_2(fn, header=True):
    results, sink = getsink()
    csv_rows(fn, header, data_sets(transform_data_sets(sink)))
    return results

What is the most Pythonic way to modify the function of a function?

Question

6 answers

solution1
4 2012-02-06 21:23:09

solution2
3 ACCPTED 2012-02-06 21:34:45

solution3
1 2012-02-06 21:22:22

solution4
1 2012-02-06 21:55:09

solution5
0 2012-02-06 21:20:17

solution6
0 2012-02-07 00:34:01

What is the most Pythonic way to modify the function of a function?

Question

6 answers

solution1 4 2012-02-06 21:23:09

solution2 3 ACCPTED 2012-02-06 21:34:45

solution3 1 2012-02-06 21:22:22

solution4 1 2012-02-06 21:55:09

solution5 0 2012-02-06 21:20:17

solution6 0 2012-02-07 00:34:01

solution1
4 2012-02-06 21:23:09

solution2
3 ACCPTED 2012-02-06 21:34:45

solution3
1 2012-02-06 21:22:22

solution4
1 2012-02-06 21:55:09

solution5
0 2012-02-06 21:20:17

solution6
0 2012-02-07 00:34:01