Keep an Engineering Case File
Work in a large enough technical organization, and eventually, people will start making the same kind of mistakes. Maybe not the exact same mistakes, but after a while, you notice that they begin to rhyme. The ones that seem to keep popping up on my professional radar are radiated emissions, audio interference, and ESD.
It’s frustrating to see similar issues pop up. After a while, you start to wonder: Why aren’t these people learning?! But if you think about it for a minute, it’s not hard to understand why they aren’t.
For one thing, not everyone has perfect memories of why project issues occur, especially if they didn’t work on them firsthand. You can’t expect everyone on an engineering team to have perfect access to everyone else’s thoughts. The inevitable passing of time only makes this worse. I can barely remember the details of bugs I worked on six months ago. More than a year? Good luck.
For another, it’s quite reasonable to expect that the person who worked to solve this class of issue last time, isn’t working on this project. In a healthy career progression, I’d assume that she learned how to design those kinds of issues out of her electronics, gained seniority, and then moved on to her next gig - either within the org, or outside of it. In either case, her departure conveniently takes all of that knowledge with her into a new role.
So, how do you make sure that tribal knowledge doesn’t turn over with the tribe members? You make a case file - a list of similar issues, with some documentation of how they were resolved. A file like this is a nice dossier on the types of tricky issues your design team keeps running into. Each individual entry is its own little case study of a type of engineering challenge you’ve overcome.
A case entry is of maximum use if you can succinctly answer a few key questions about each individual issue:
- What was the problem?
- How was it noticed?
- What are the symptoms?
- What did you do to debug the problem?
- Did you change anything that made it better?
- Did you change anything that made the problem worse?
- What was the root cause?
- What did you change in the design to fix it?
This doesn’t have to be a long, detailed, exhaustive report. In fact, I think it’s better if you keep it short. You want this information to be succinct and digestible. This serves two purposes:
- It becomes engaging training material for new employees. You can quickly educate new hires on the specific kinds of issues your team has tripped over, and how you’ve recovered. Keeping it short helps keep their attention; a parable about a design foible (rather than an epic) prevents boring your new hires to death during onboarding. (More experienced hires may even see this, think “I know a way to solve this problem!” and try to start their tenure with you on a high note by offering a process improvement. But, let’s be honest - very few managers are going to have the courage to let the brand new guy upset the working order.)
- It’s a much quicker, handier source of debug ideas when someone refers to it in desperation, trying to crack a new bug they’ve never seen before. Getting to the point helps prime the pump for debug ideas, without making the recipient feel their time is being wasted.
I think this is an extremely useful practice, either on the level of an individual contributor, or on a managerial level. If you’re an IC, learning about your colleagues’ bugs is basically getting free training from your peers. Follow the bugs filed against their projects. Ask them questions about their debug steps. Read their bug reports. You never know when you could be called upon to fix a similar problem - either at this job, or the next!
This is a powerful practice for an individual, but it multiplies drastically in effectiveness when adopted by a manager. When a manager takes time to document bugs and their solutions, she isn’t doing this just for herself. She’s building high level collateral to keep her team sharp. Imagine, for a moment, that you find out that your design is failing a test that you’ve never even heard of before. (Inconceivable to some, but I’ve read enough about multi-thousand-page requirement docs for military contractors, that I have believe it’s a thing that happens.) You bring this up to your manager in your weekly 1:1, and she replies “Huh - that sounds a lot like something we’ve seen before. Check JIRA issues 123, 456, and 789 - they might give you some ideas of how to attack this.” You now have three bugs’ worth of debug info to refer to in your hunt.
Since we’re talking about case files kept by managers, a quick aside on brevity: if you’re an engineering manager who’s keeping a case file, keep it quiet from your local Quality guy. That’s just asking for someone to “contribute” to a useful process in a way that makes it completely miserable for all participants. A case study isn’t an 8D report - you’re not using it as a process hammer to whack another team over the head. Quite the contrary. The whole point of this file is to teach people how to make technical decisions that are good for the business. Done well, a case file starts to serve as a source of institutional memory. Better still: this memory lives in digital form - not meatspace, which suffers from conditions like bit rot and career advancement.
When you keep track of a certain class of problem for long enough, yet more patterns begin to emerge. Specifically, you start to notice repeated flaws, and, more importantly, repeated fixes. Ideally, this flows into design patterns that you should implement to prevent this class of problems. That’s a good move, and a time saver, and the ultimate goal of this exercise. All the same, I’m the sort of annoyingly curious personality who always wants the why behind rules. And that’s what the case file is for: it’s the “why” behind your design guidelines.
Here’s an example of my personal case file for audio interference and degradation issues:
- Project A experienced a big audio “pop” during product boot after the CPU was redesigned.
- The pop was easily visible on a ‘scope with the new CPU design, and not apparent on the old CPU design.
- The pop vanished when we hard-enabled the mute circuit with resistors, instead of being controlled by the original sequence control hardware.
- Root cause: the new CPU design wasn’t configuring the DAC before the mute circuit was disabled.
- Fix: move the DAC software configuration routine to an earlier stage of the software boot up.
- Project B experienced a noticeable hiss and crackle when the amplifier was unmuted, and no music was playing.
- Noise was visible when an Audio Precision was used to view the FFT of the output. (Thank God for Audio Precision’s filter banks.)
- The noise decreased in intensity when the wifi radio was shut off. It spiked in intensity when the radio stress test was run. The noise vanished completely when the 3.3V rail was rerouted from under the amp output filter using wire.
- Root cause: current drawn on the 3.3V rail was coupling into the amp output filter.
- Fix: reroute the 3.3V rail!
- Project C experienced a pop during boot up, even with the mute circuit enabled.
- A small pop was noted at the audio line output, and was traced back up the audio chain to the DAC.
- Probing the DAC output revealed a fast step transient at the output.
- Root case: A review of the mute component specs showed higher than anticipated parasitic capacitance in FETs used to achieve muting.
- Fix: Add a T filter of series resistors and a single capacitor between the DAC and mute circuit, to attenuate the fastest part of the signal before it reaches the mute FETs.
I’m the kind of person who loves collecting and cataloging information, so I keep a personal case file in my Evernote backlog. You could just as easily make up a case file system with any number of other tools. A list of closed Jira issues, a Confluence page, or even a Google Slide deck could do the job quite well.
Do you keep a file like this? Do you find it useful? Is there anything I’m not recording that I should be? Let me know at firstname.lastname@example.org.