Friday, December 7, 2012

Production Mistakes

Once upon a time I had a very bad scare. One of my responsibilities was to maintain a vendor created application that handles the flow of commodity trades on an exchange into our risk management system. That particular aspect of my job was not my favorite part but someone had to do it, right?

This vendor application was primarily a configuration driven monstrosity. Messages are in XML and the conversion from the format that the exchange provided us with to the schema that our back end systems required was primarily done using a tag mapping approach. For example, if the exchange provides us with an XML messages and the counterparty in the deal was represented as an ID, say 12345, the application then maps that number to a mnemonic that is used by our system, say BGKG. Simple, right?

Yes, it is very simple. It also doesn't scale for a damn. There were around 500-ish of these mappings in production. And that's just the tag mapping portion. That doesn't take into account any business logic driven mappings, XSL files that mangle the XML, etc. This quickly becomes a pain point for maintenance, especially when there are potentially multiple different places for logic to hide on any given mapping. The logic could be in the XSL outside the application, the message logic built into the application, there are templates for each message route that can contain the logic, and several more places. On top of this there was nothing preventing the spread of logic affecting the same element(s) from being shared among ALL of these locations!

Now that you have more background than you care about, I will actually get to the scary part. When I first started working on this, the traders that actually execute the trades on the exchange have an ID assigned by the exchange. The means of mapping this to our system ID was to have a list of all the traders exchange ID's in an XSL file and test for a match, replacing the exchange ID with the local system ID. This is ugly for several reasons. First, it's long, tedious and error prone to maintain this list. I don't know when the last time it was updated before I saw it but I know that they never removed anyone from that list, ever. Even if someone left the company, they stayed there. Second, this means that the Risk Management group (the business unit owner of the application) was dependent on IT development (i.e., me) to make a code change, promote it from the development environment through the QA environment to the production environment every time they wanted to add a trader to the system. This can become painful for a company that grows through acquisition on a regular basis, by the way. I could go on but you get the point and I'm sure you can come up with your own reasons I haven't thought of as to why this is a bad way to handle it.

I wanted to empower the business users of this application to be able to manage their own traders in the system. They do it for all the other aspects and this is supposed to be configurable by the business users anyway. Configuration changes don't have to go through the same channels as development and I don't want a call while on vacation to run something like this through development to production because they can't do it themselves.

I started by creating a simple DB schema that holds look up tables exchange IDs to our system IDs (and a few other things). Nothing complicated but if I put that kind of mapping into a database, I can then stand a web front end up on it and make that available to the users. Tada!! Now they can configure to their hearts content without bothering me.

Now, life being what it is, while I converted the mappings to use the database and had promoted it to production several months earlier, I had not set up a web front end yet.

We were testing changes in QA that would allow us to trade financial instruments that we had never traded before. Obviously we needed to test and make sure that the new trade types worked as expected but also to do some regression testing and make sure that these changes hadn't broken existing trades. We were getting some odd results from the new trades in QA so I went investigating. After discounting all the obvious answers, I went back to basics and looked again at them. It turns out that the configuration of the data source that was being used for the table look ups was not providing the right answer because it was pointed at the wrong database. It was still looking at the development environment.

I had a sudden cold chill and after the shivers stopped I dared to look at the configuration in production and saw that yes, indeed, production was pointed to the development environment. I thought I might have a heart attack. Now, kids, this is emphatically NOT anything you ever want to have happen to you. Bear in mind that this industry was heavily regulated by SOX. Any idea what would happen if a SOX auditor got wind of something like that? Phrases like, 'career changing learning experience', spring immediately to mind.

Now, the good news is that all is well that ends well. Everything was fine, we got it all changed without a hiccup and life went on. The part that scares me is what might have happened. I mean, this is development, man. If had taken it into my head at any point during those intervening months to blow away the entire database, much less wipe a table (both of which I can, will and are perfectly valid things to do in development) what kind of damage would have been done? Fortunately, not a lot because recovery would have been simple (thank GOD) but how long would it have taken for us to figure out what happened? It took me a couple of days of poking around to finally decide I should go back to basics and check the data source configuration at the application server level.

My point here is, that mistake made me want to crap my pants. It was amateur hour kind of stuff that I should never have let happen. Even if I wasn't the one actually handling the deployment, I sit over the shoulder of the SA and DBA's for almost every production deployment I make and I should have seen it, thought about it or something! But despite all that that, it was good for me in the end. Like most everyone who has ever done something as boneheaded as that, you can be sure I will be a damned sight more careful with my deployment instructions and double-checking both my and the deployment engineers work. It also reminded me that while things like vendor applications are a fact of life sometimes and you don't have an actual build process for them, there is nothing preventing you from creating a deployment build for it. I might not have to compile code but I can damned sure set up the build server to check out a tokenized version of the any configuration files that have changed and have those tokens replaced with the proper environment settings for whatever environment is being promoted to. Also, you can set up such a build so that it breaks on deployment if you forgot to put in the environments DB password, for example. You really don't store password in your source code repository right? Right??

The only real defenses against mistakes like these are to be disciplined and diligent in your pursuit of the perfect build process to automate these tedious things and/or to document them. I suck at documentation but it would have saved much time and headache if I didn't. Obviously, I was not diligent or disciplined enough in my build process, mostly because until that event happened, I really didn't think of it as a build process. My only defense is that working with vendor products is not something that I have had a lot of experience at. I mean, create and deploy a JBoss server with whatever configuration files it needs but that was always in the context of writing actual code. This puts my brain in a completely different mindset than I was in when I was thinking, how can I configure this vendor application in a more useful fashion.

Ah well, live and learn.

No comments:

Post a Comment