When TDD gets hard

8 min readApr 14, 2021

Test-driven development (TDD) is a core IBM Garage practice. It provides the foundation for the other practices such as continuous delivery, DevOps, and automation.

When you work in a test-driven style, before writing any code you have to figure out what success looks like for the change you are making. Next, you work out how to validate success in an automated way. Then, you implement that automation. Finally, you code the thing.

The actual implementation is almost always the easy part. The first part of the process — figuring out what success looks like — is often the hardest part. Working out how to automate success-checking is the second hardest. Basically, TDD is hard! It needs skill, and it needs practice.

The good news is that TDD rewards the effort. Once you get over the hurdle of working incrementally and writing fine-grained tests (hard), you’ll find the implementation slots into place. Your tests will improve the clarity of your code, help with debugging, support future refactoring, and help prevent regressions.

TDD isn’t just for greenfield development. It’s also a great bug-fixing tool. In the IBM Garage we have an informal rule that we should never fix a bug without first having a failing test. Instead of adding a console.log, write a test. If something is broken, think of the steps you might do to diagnose the issue (“Is the server up? Is it available on this port? Does it respond on this endpoint? Is the response json?”) and write tests to answer the questions.

Does TDD make sense in all contexts?

Did I already mention TDD is hard? Many developers like the idea of TDD in theory, but stumble in practice. It can be hard to make the jump from a contrived example to actual messy code with dependencies and deadlines. I’m not sure us TDD fans offer enough support at the messy end of the spectrum.

Here are some of the areas where I’ve seen devs with good intentions get stuck and fall off the TDD wagon.

TDD is fine for web apps, but this isn’t a web app, it’s AI/infrastructure/deployment of [some product] with [some config]

When you reframe TDD as ‘validating success’ rather than ‘writing a unit test,’ it should become clear that TDD is broadly applicable. Here are questions to ask (and try to turn into “yes”):

Can I slice my implementation of this into very small steps that build on each other?
How would I know each step worked?
Can I automate the checking it worked?

Some of my IBM Garage colleagues have had excellent results using TDD for data science and AI projects. For example, when developing Watson Assistant conversation workflows, it’s a common problem for dialog flows to regress because a new intent or too many examples on a single intent “siphons up” content which is supposed to be handled by other intents. These problems should be detected with automation, in a DevOps pipeline, well before the Assistant workspace gets deployed to live users. When developing, it works well to write down the expected conversation flows in tests first, see the tests fail, then update the Watson data and confirm that the tests pass.

Data models lend themselves well to being exercised in an automated way. Is the model returning a number? Is the number within the range we’d expect? If we’re predicting house prices, and our model says a four-bedroom house with original period features in downtown Toronto costs $100, we should catch that before deploying the model to production! Is the accuracy of the model within acceptable bounds for our test data set?

My Garage colleagues are also exploring what TDD looks like for infrastructure. Infrastructure is traditionally harder to validate in an automated way, but if infrastructure is code, we need to learn how to test that code.

This is untestable

We should try and test everything, but… some frameworks and products, particularly older ones, just aren’t testable. There is no way to validate if they’re working as intended in an automated way. Sometimes, this is a good opportunity to feed feature requests back to the product team, or to reconsider your choice of technology stack. Other times, we just have to live with it.

You might be able to make some progress using ‘ugly’ validating techniques like screen scraping. If a person can tell something’s working right, can a computer do the same steps? (And if there’s no way for a person to know things are working right, there should definitely be some feedback to product management!)

On the other hand, we need to be pragmatic; if things aren’t cleanly and quickly verifiable, it might not be worth the expense of wrestling it into a set of tests. I’ve seen pairs spend half a day or a day trying to test something, perhaps writing elaborate mocks, for code that was unlikely to go wrong. Before going down that road, ask yourself some questions: How confident are you that you’ll get things right the first time? How likely is it that a change later on could cause a regression? Can you move the tricky parts of logic out into an easily testable unit, and leave a thin layer of dumb code in the untestable part? TDD is a means to an end, not a goal in itself.

We don’t have time for this

Even with a pragmatic approach, expect TDD (or testing at all) to take time. I usually estimate that half the development time is spent on TDD. That sounds like a lot, but it’s less when you factor in that the development itself is faster. Much of the low-level technical design is done in the ‘test’ phase, not during the dev phase, and the tests make debugging not-quite-working code much easier.

TDD can also save a lot of time in manual testing. Often developers end up doing the same manual steps over and over again to validate code as they write it. For example, “let’s load this page and type this in and then click on this link.” Automated tests reduce this toil.

A solid suite of tests also saves a lot of time hunting down regressions in existing function and it reduces the chances of regressions escaping to the field and causing embarrassment.

Despite all these benefits, if a project is under time pressure testing is often the first thing to go. This can give the illusion of a faster pace, but the ‘saved’ time must be paid back later with interest. The path from ‘feature-complete’ to ‘in-production’ might be a rocky one.

On the other hand, some code will never get to production and was never intended to get there. If code is genuinely ‘throw away,’ investing in automated tests probably isn’t worth it.

We have SonarQube

Ok, so this one isn’t always bad! Some customers may mandate code coverage metrics as part of their development governance. However, be aware that SonarQube is not a mandatory part of software development, and SonarQube and other code coverage tooling can drive unhelpful behaviors.

Writing the tests before the implementation and working in tiny batches usually ensures good coverage, resulting in experienced TDD practitioners skipping the use of coverage tooling. For teams learning the technique, it can be useful for spotting slip-ups: “oops, looks like we got over-excited and forgot all the tests when we wrote this function, lesson learned.”

It’s important that the coverage metric should be a tool, rather than a goal. Otherwise teams spend effort writing low-value tests to boost the coverage numbers, rather than meaningful validations of correct behavior. For example, tests that exercise tricky logic with multiple input parameters might be crucial in guiding the implementation to the correct behavior and preventing regressions, even though they don’t change the coverage metrics. Tests for getters and setters can inflate coverage numbers but they have almost no chance of catching regressions.

Things that look like tests but don’t test (also known as “testing the mock”)

When something depends on a complex external library or system, it’s not practical to include that system in the tested unit. Instead, it’s stubbed out with a mock. Mocks can be useful for enabling cheap fine-grained testing, but there are some cautions.

Mocks bake your assumptions about the external service into the mock. Since incorrect assumptions are a large source of bugs, this means testing against a mock may give a false sense of confidence. For HTTP and message-based external services, consider using a Pact contract test instead, which allows assumptions to be expressed in a contract that can be validated against the external service. Both the client and the producer are testing against the exact same set of assumptions, without the expense of integration testing.

My team often doesn’t bother testing light wrappers around external services (such as database access). If the wrapper is thin it’s unlikely to have bugs, writing a mock is expensive, and any bugs are likely to be in the assumptions about how the external service works, so a mock wouldn’t catch them anyway.

Sometimes we see a more serious problem with mocks. Because mocks have behaviors, it’s tempting to try and validate those behaviors. A delivery team can get overexcited and end up writing a suite of tests that exercise the mock. It feels like testing, but it’s not valuable. Make sure that what a test is validating is the implementation code, not any supporting mocks or test infrastructure.

The tests all pass and production is broken

Frankly, this happens. Tests aren’t perfect, even TDD tests. I’ve sometimes found myself with a beautiful green test dashboard, and a totally non-functional application. This is a sad moment for the team, but a good learning opportunity. It’s a good time to practice TDDGBF (Test Driven DebuGging and Bug Fixing): write a failing test to show the error that is in production, and then make the test pass by fixing the implementation.

Once the issue is fixed, take a moment to reflect on why the break escaped to production. Could earlier testing have caught the issue? Did we forget to test-drive some implementation code, or was it something that we wouldn’t expect unit tests to catch? Often production failures happen because of integration issues; either the system is not wired together correctly or components aren’t interacting properly. Consider adding some lightweight smoke tests at the end of the deployment pipeline to catch trivial wiring issues. Interaction problems are more fundamental, and warrant proper test-driving. This can be a good time to look into contract tests, which work well as part of a TDD workflow. The contract framework acts as a handy mock for the service consumer, is a cheap behavior validation for the service provider, and ensures components continue to inter-operate correctly.

Every time we change the code, we have to fix the tests

This is bad. The first reason it’s bad is that the tests should be a skeleton that supports the code through refactoring. Many people advise being strict and changing either the tests, or the code, but not both at the same time. If changing the code breaks the tests, you’re automatically forced to change both. While your red tests are being rewritten, they’re not helping you catch implementation bugs.

The second reason it’s bad is that it’s expensive. Tests are supposed to speed development, not slow it down by requiring constant rework.

Sometimes tests are so coupled to the internals of the code that they’re almost like a second copy of the implementation. This gives high coverage metrics, but slows development and isn’t great for catching problems. Try and limit the unit tests to testing just the ‘externals’ of each unit. If a test needs constant maintenance, consider if it’s actually adding value and finding real bugs — if not, you are allowed to just delete it.

TDD requires highly skilled developers. It takes time and practice to get good at it and there can be some frustration along the way. However, after you’ve developed the habits, it’s an incredibly rewarding way of developing systems.

To learn more about IBM Garage, visit ibm.com/garage.