A very brief story about me, Python, and testing...
I like to write tests. Even this slide deck has tests!
I mostly do web development, but I've tried to keep this talk general.
I didn't create these things, but I've done a lot of work on them.
A GitHub recommendation engine!
Find the projects you ought to know about...
It's been done already. Oh well.
def similarity(watched1, watched2): """ Return similarity score for two users. Users represented as iterables of watched repo names. Score is Jaccard index (intersection / union). """ intersection = 0 for repo in watched1: if repo in watched2: intersection += 1 union = len(watched1) + len(watched2) - intersection return intersection / union
Similarity score, 0 to 1.
Jaccard index: intersection over union.
Terrible implementation: buggy and slow. Software!
Now of course we want to make sure it works!
>>> similarity(['a', 'b'], ['b', 'c', 'd']) 0.25 >>> similarity(['a', 'b', 'c'], ['b', 'c', 'd']) 0.5 >>> similarity(['a', 'b', 'c'], ['d']) 0.0
So far, so good!
But here's a bug...
>>> similarity(['a', 'a', 'b'], ['b']) 0.3333333333333333
Jaccard index is a set metric, and our bad implementation doesn't handle duplicates correctly. Should be 1/2, got 1/3.
Fortunately, Python's got an excellent built-in set data structure, so let's rewrite to use that instead and fix this bug!
def similarity(watched1, watched2): """ Return similarity score for two users. Users represented as iterable of watched repo names. Score is Jaccard index (intersection / union). """ watched1, watched2 = set(watched1), set(watched2) intersection = watched1.intersection(watched2) union = watched1.union(watched2) return len(intersection) / len(union)
>>> similarity(['a', 'a', 'b'], ['b']) 0.5
Great, works!
But we totally rewrote it, better make sure we didn't break anything...
>>> similarity({'a', 'b'}, {'b', 'c', 'd'}) 0.25 >>> similarity({'a', 'b', 'c'}, {'b', 'c', 'd'}) 0.5 >>> similarity({'a', 'b', 'c'}, {'d'}) 0.0
All good!
We know how to handle boring repetitive tasks, we write software to automate them!
from gitrecs import similarity assert similarity({'a', 'b'}, {'b', 'c', 'd'}) == 0.25 assert similarity(['a', 'a'], ['a', 'b']) == 0.5
Better! Easily repeatable tests.
Hmm, another bug.
from gitrecs import similarity assert similarity({}, {}) == 0.0 assert similarity({'a', 'b'}, {'b', 'c', 'd'}) == 0.25 assert similarity(['a', 'a'], ['a', 'b']) == 0.5
Traceback (most recent call last): File "test_gitrecs.py", line 3, in <module> assert similarity({}, {}) == 0.0 File "/home/carljm/gitrecs.py", line 14, in similarity return len(intersection) / len(union) ZeroDivisionError: division by zero
We can fix the bug, but we have a problem with our tests: because the first one failed, none of the others ran.
It'd be better if every test ran every time, pass or fail, so we could get a more complete picture of what's broken and what isn't.
def test_empty(): assert similarity({}, {}) == 0.0 def test_sets(): assert similarity({'a', 'b'}, {'b', 'c', 'd'}) == 0.25 def test_list_with_dupes(): assert similarity(['a', 'a'], ['a', 'b']) == 0.5 if __name__ == '__main__': for func in test_empty, test_quarter, test_half: try: func() except Exception as e: print("{} FAILED: {}".format(func.__name__, e)) else: print("{} passed.".format(func.__name__))
Some code to run each test, catch any exceptions, and report whether the test passed or failed.
Fortunately, we don't have to do this ourselves; there are test runners to do this for us!
test_empty FAILED: division by zero test_sets passed. test_list_with_dupes passed.
from gitrecs import similarity def test_empty(): assert similarity({}, {}) == 0.0 def test_sets(): assert similarity({'a', 'b'}, {'b', 'c', 'd'}) == 0.25 def test_list_with_dupes(): assert similarity(['a', 'a'], ['a', 'b']) == 0.5
One of these runners is pytest; we can install it and cut our test file down to just the tests themselves, no test-running boilerplate at all.
$ py.test =================== test session starts =================== platform linux -- Python 3.3.0 -- pytest-2.3.4 collected 3 items test_gitrecs.py F.. ======================== FAILURES ========================= _______________________ test_empty ________________________ def test_empty(): > assert similarity({}, {}) == 0.0 test_gitrecs.py:4: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ def similarity(watched1, watched2): intersection = watched1.intersection(watched2) union = watched1.union(watched2) > return len(intersection) / len(union) E ZeroDivisionError: division by zero gitrecs.py:14: ZeroDivisionError =========== 1 failed, 2 passed in 0.02 seconds ============
Run py.test - it automatically finds our tests (because they are in a file whose name begins with "test", and each test function's name begins with "test") and runs them, with isolation so that even if one fails, they all run. (Can also run just one test file, or directory.)
It shows us the test file it found, shows a dot for each passed test and an F for each failed one. And we get some nice helpful debugging output around the failure too.
def similarity(watched1, watched2): """ Return similarity score for two users. Users represented as iterable of watched repos. Score is Jaccard index (intersection / union). """ watched1, watched2 = set(watched1), set(watched2) intersection = watched1.intersection(watched2) union = watched1.union(watched2) if not union: return 0.0 return len(intersection) / len(union)
$ py.test =================== test session starts =================== platform linux -- Python 3.3.0 -- pytest-2.3.4 collected 3 items test_gitrecs.py ... ================ 3 passed in 0.02 seconds =================
Not only can we ship this code with some confidence that it works now, but also some confidence that if we change the implementation in the future and reintroduce any of these bugs, we'll catch it as soon as we run the tests.
A brief synopsis and digression
If all these choices are overwhelming, don't worry about it. They're all fine, just pick one and run with it.
from unittest import TestCase from gitrecs import similarity class TestSimilarity(TestCase): def test_empty(self): score = similarity({}, {}) self.assertEqual(score, 0.0) def test_sets(self): score = similarity({'a'}, {'a', 'b'}) self.assertEqual(score, 0.5) def test_list_with_dupes(self): score = similarity(['a', 'a'], ['a', 'b']) self.assertEqual(score, 0.5)
Note the use of methods on self (assertEqual and friends) rather than simple asserts.
class GithubUser: def get_watched_repos(self): """Return this user's set of watched repos.""" # ... GitHub API querying happens here ... def similarity(user1, user2): """Return similarity score for given users.""" watched1 = user1.get_watched_repos() watched2 = user2.get_watched_repos() # ... same Jaccard index code ...
You may have been thinking, of course tests are easy to write when you're testing nice simple functions like that similarity function.
Here's a secret: that wasn't the first version of similarity that I wrote. The first version looked more like this.
Imagine writing tests for this similarity function.
class FakeGithubUser: def __init__(self, watched): self.watched = watched def get_watched_repos(self): return self.watched def test_empty(): assert similarity( FakeGithubUser({}), FakeGithubUser({}) ) == 0.5 def test_sets(): assert similarity( FakeGithubUser({'a'}), FakeGithubUser({'a', 'b'}) ) == 0.5
We take advantage of duck-typing and create a fake replacement for GithubUser that doesn't go out and query the GitHub API, it just returns whatever we tell it to.
This is a fine testing technique when testing code that has a collaborator that is critical for its purpose. But when you have to do this, it should cause you to ask yourself if it's essential to what you want to test, or if the design of your code is making testing harder than it should be.
In this case, the collaborator is an avoidable distraction. What we really want to test is the similarity calculation; GithubUser is irrelevant. We can extract a similarity function that operates just on sets of repos so it doesn't need to know anything about the GithubUser class, and then our tests become much simpler.
Function (or class, or module - whatever the system under test)
In this case, similarity is harder to test if it knows about GithubUser, because we have to set up a GithubUser (or a fake one) to feed to it for every test. And it's also more fragile, because if the name of the get_watched_repos method changes, it will break.
It knows more than it needs to know to do its job! By narrowing its vision of the world, we make it both easier to test and easier to maintain.
Unit.
Integration.
Making testing a habit instead of a chore.
Too much religion around this question. Tests will help you no matter when you write them.
First you formulate clearly what you want your code to do, then you get to write code and get the satisfaction of seeing the test pass. Psychology of test-last is all wrong: first you get it working, then writing tests feels like a chore.
def recommend_repos_for(username): my_watched = get_watched_repos(username) ... def get_watched_repos(username): # stub return {'pypa/pip', 'pypa/virtualenv'}
When you need to write a bunch of exploratory code to figure out the problem space.
To write tests, you have to have some idea of the shape of the code you need to write. Sometimes for an unfamiliar problem space you have to write the code first to learn about what kind of shape it needs.
Always, always step one. This is often part of finding and understanding the bug. If you haven't written a failing test, you don't have a bug identified yet.
definition of legacy code: code without tests.
--branch: record not only which lines were and weren't executed, but also whether all branches of a conditional were taken.
coverage run is the most flexible way to run coverage (can run any python script); there are also plugins for py.test and nose that give it more integration with the test runner.
coverage html generates an HTML report.
100% coverage only tells you that every line of code in your program can run without error. That's a good minimum baseline; it doesn't tell you whether those lines of code are actually doing the right thing! It is a useful way to make sure that you have at least one test checking every case that is handled by your code (including error cases).
Didn't have time to cover everything in depth, but here are a few more testing tools you should check out:
It's not easy: test code is real code and requires discipline, engineering, and investment to make it correct and maintainable. But it's worth it!
Space | Forward |
---|---|
Left, Down, Page Down | Next slide |
Right, Up, Page Up | Previous slide |
P | Open presenter console |
H | Toggle this help |