How do you test more complicated functions?

Question

I'm a total amateur/hobbyist developer trying to learn more about testing the software I write. While I understand the core concept of testing, as the functions get more complicated, I feel as though it's a rabbit hole of varations, outcomes, conditions etc. For example...

The function below reads files from a directory into a Pandas DataFrame. A few columns adjustments are made before the data is passed to a different function that ultimately imports the data to our database.

I've already coded a test for the convert_date_string function. But what about this entire function as as whole - how do I write a test for it? In my mind, much of the Pandas library is already tested - thus making sure core functionality there works with my setup seems like a waste. But, maybe it isn't. Or, maybe this is a refactoring question to break this down into smaller parts?

Anyway, here is the code... any insight would be appreciated!

def process_file(import_id=None):
    all_files = glob.glob(config.IMPORT_DIRECTORY + "*.txt")

    if len(all_files) == 0:
        return []

    import_data = (pd.read_csv(f, sep='~', encoding='latin-1',
                               warn_bad_lines=True, error_bad_lines=False,
                               low_memory=False) for f in all_files)

    data = pd.concat(import_data, ignore_index=True, sort=False)
    data.columns = [col.lower() for col in data.columns]
    data = data.where((pd.notnull(data)), None)

    data['import_id'] = import_id
    data['date'] = data['date'].apply(lambda x: convert_date_string(x))

    insert_data_into_database(data=data, table='sales')
    return all_files

Answer 1

In general, I wouldn't go down the road of testing pandas or any other dependencies. The way I see it, it is important to make sure that a package that i use is well developed and well supported, then making tests for it will be redundant. Pandas is a very well supported package.

As to your question about the specific function and interest in testing in general, I will highly recommend checking out the Hypothesis python package (you'r in luck - its currently only for python). It provides mock data and generates edge cases for testing purposes.

an example from their docs:

from hypothesis import given
from hypothesis.strategies import text

@given(text())
def test_decode_inverts_encode(s):
    assert decode(encode(s)) == s

here you tell it that the function needs to receive text as input, and the package will run it multiple times with different variables that answer the criteria. It will also try all kind of of edge cases.

It can do much more once implemented.

Answer 2

There are mainly two kind of tests - proper unit tests, and integration tests.

Unit tests, as the name implies, test "units" of your program (functions, classes...) in isolation (without considering how they interact with other units). This of course require those units can be tested in isolation. For example, a pure function (a function that compute a results from it's inputs, where the result depends only on the inputs and will always be the same for the same inputs, and which doesn't have any side effect) is very easy to test, while a function that reads data from a hardcoded path on your filesystem, makes http requests to a hardcoded url and updates a database (whose connection data are also hardcoded) is almost impossible to test in isolation (and actually almost impossible to test).

So the first point is to write your code with testability in mind: favour small, focused units with a single clear responsability and as few dependencies as possible (and preferably taking their dependencies as arguments so you can pass a mock instead). This is of course a bit of a platonic ideal, but it's a worthy goal still. As a last resort, when you cannot get rid of dependencies or parameterize them, you can use a package like mock that will replace your dependencies with bogus objects having a similar interface.

Integration testing is about testing whole subsystems from a much higher level - for example for a website project, you may want to test that if you submit the "contact" form an email is sent to a given address and that the data are also stored in the database. You obviously want to do so with a disposable test database and a disposable test mailbox.

The function you posted is possibly doing a bit too much - it reads files, builds a panda dataframe, applies some processing, and stores thing in a database. You may want to try and factor it into more functions - one to get the files list, one to collect data from the files, one to process the data etc, you already have the one storing the data in the database - and rewrite your "process_files" (which is actually doing more than processing) to call those functions. This will make it easier to test each part in isolation. Once done with this, you can use mock to test the "process_file" functions and check that it calls the other functions with the expected arguments, or run it against a test directory and a test database and check the results in the database.

How do you test more complicated functions?

Question

2 answers

solution1
0 2018-08-23 14:46:26

solution2
0 ACCPTED 2018-08-23 15:27:25

How do you test more complicated functions?

Question

2 answers

solution1 0 2018-08-23 14:46:26

solution2 0 ACCPTED 2018-08-23 15:27:25

solution1
0 2018-08-23 14:46:26

solution2
0 ACCPTED 2018-08-23 15:27:25