Redirection of Organizational Pain
2024-04-04
Working at a large tech company means often working with people who are not within the same org/sub-org as you. In fact, being able to communicate and coordinate across organizational boundaries should be considered a core competency for any engineer. If you're lucky, the priorities of the orgs you're working with will be aligned with your own. There's a kind of zen to this state. Everyone is rowing in the same direction. You'll feel wonderfully efficient.
More often then not though, at least some of your priorities won't be aligned. Sure, you're all working towards the same, large end-goal. But what needs to be worked on right now will almost certainly be different. I don't think this is necessarily a failure mode.
Ensuring Responsibility
You just got paged. The alert is low-priority, so it's not too big of deal. Still, it's not great that the alert is firing. You do a little digging and realize that one of the services your service depends on has recently deployed a breaking change. You reach out to their on-call, agree that there is indeed a problem on their end, and they promise to fix things shortly.
Except they don't. Or maybe they just can't. At least not right away. Maybe it's not even their fault. Maybe one of their dependencies recently changed and they now find themselves backed into a corner. Maybe they're understaffed. Either way, this problem isn't getting fixed any time soon and your alert is still firing. The team agrees to look into the problem, but makes no firm commitment on a full fix. They've got bigger problems at the moment, i.e. priorities that don't align with yours.
Three months later you still have a broken alert and other teams are complaining that you're technically violating your SLAs.
The solution here is surprisingly simple: change who gets the alert. This is not a problem you're capable of fixing, so what good does it do for you to receive pain? If you yourself have customers that are now experiencing issues, make it super clear where the problem actually lies either through improved error messages or other automation. Redirect the pain to those who are actually empowered to fix the problem1.
Given a big enough issue, I guarantee this will get sorted out quickly. And if it doesn't, well that's fine, you're not getting paged anymore.
This is not to say you should wantonly shirk responsibility for issues that your team ostensibly owns. It's just an acknowledgement that you're not always in the appropriate position to fix an issue. In such cases, it's actually healthier for the company at large to have pain redirect to the right place. Can you imagine a human who, should they put their hand on a hot surface, felt a burning sensation in their foot? Surely, evolution would quickly weed out such a creature.
A Spirit of Cooperation
In any large company, there are going to be competing incentives and priorities. That's just the game. A good engineer only resorts to the solution described above after all other attempts to remediate the issue have failed. Note that in the previous scenario, the first thing the alerted engineer did was reach out to their organizational counterpart. Only once all attempts to persuade and cajole have failed should you resort to the redirection of pain. But at that point, use it without remorse, confident in the fact you're actually doing what's best for your company at large.
Footnotes
[1]: Theoretically, there is another solution: take ownership of the other team's project. Then all of a sudden you can fix the problem! This is a pretty drastic measure though, so attempt any land grabs with extreme caution.