an obession with first principles

Data Architecture in Complex Projects

Posted: Wednesday Feb 22nd | Author: JohnO | Filed under: Design, Programming | View Comments

This post is an attempt to be more a comprehensive thought that I expressed here in a previous post. It’s also an attempt to think in-depth about something besides todays annoying politics.

Every project starts off as an MVP. Which means it starts off small in scope, and as simple as it can possibly be. The path of least resistance for every project is a normalized relational database accessed via an OOP ORM. The problem is that this works very well. When projects are simple and small this is undeniably the fastest and most bug-free approach. Some projects grow in size, but remain simple. There are just more UI elements, more records to be tracked. But in a real way, the project remains a simple implementation of simple business rules where there are few considerations, few conditional code paths, and little interdependence between elements of the system.

There are many different kinds of complexity in business systems.

Historical data is a complexity where normalized relational data fails the hardest. In these systems there is a core set of data around which all of the business rules operate. When you have complex business rules and unexpected results you’re trying to explain, if you do not have clear record of the operations performed on the historical data you are going to have a real hard time explaining, understanding, and ultimately fixing the unexpected results. Normalized relational data always retains the current value(s) of the data set, while discarding previous values. And you have to fight hard to try and keep previous values.

The complexity of multi-user interfaces are where OOP ORMs fail the hardest. Once more than one single user is operating on a object/record/dataset you instantly create a problem where the last update wins, and it crushes previous changes. These issues can create serious problems and unexpected results in your dataset without any clear path of resolution. Your solutions are inherently limited, and then further limited based on the constraints of the rest of your architecture. Getting a full eventually-consistent backend is highly unlikely because of the time and effort you’ll need to complete it.

A smaller complexity that does exist is around reporting. Every business process needs reporting. It’s never the first thing you write. It’s always the last. The performance needs of your operational system always come first. Optimizing your relational data, indexes, etc for daily operations means its never optimized for reporting.

Principles That Prevent Problems

Keep that core data your business rules rely around in your language’s native types (for me in Python that means, list, dictionaries, tuples, strings, ints, and Decimals). Doing this means that you can store it in all the various places you’ll need to. If you have to put it in a relational database, you can store it as a JSON encoded string which you can then parse with standard libraries into your language’s native types. Or if your database supports native JSON storage, even easier. If you have a document store available it will go there without a problem as well. Try to avoid storing your base data set as a set of related records in a database. It means that all if your main business rules are then *forced* to be run through your ORM layer code. In a complex system this is very limiting as I outlined above.

When you have recorded all inputs and operations and have the original data set you can always replay those operations. So if a serious business rule changes you can easily run simulations to see what the outcome of that change will be. And you don’t have to write extra/repeated code or migrations to change you data. You’ve already written it, in one place, and tested it. This architecture pattern allows you to write more functional code. Which is easier to write, read, and test. This architecture pattern allows your to move easily towards more micro-services/distributed servers when your load demands it.

Creating a second system for reporting where data is ported (either in batches or real time) into it and then reported is a necessary step. It also creates a good pattern that your reporting data, if it is ever found to be wrong/have bugs, you can always regenerate it–because its existence is based on a record elsewhere.


RDBMS and Data

Posted: Wednesday Mar 23rd | Author: JohnO | Filed under: Programming | View Comments

I am coming to the realization that certain patterns and systems need a non-standard idea. Well, its not entirely non-standard, its becoming quite standard: the document store. The patterns that I am currently dealing with are interesting. Normally you would think they are perfect for RDBMS operations. And lots of the characteristics of the system really do. And those wouldn’t change. However, when it reaches a level of interdependence, where each piece of data can’t actually contain its own logic, because it has no meaning without other pieces of data, this is where RDBMS falls down for me. You have to fight against it to get good performance. All because there is one central piece of logic that needs all the pieces of data. And that means multiple JOINS, over many different indexes.

This central black box logic that you continually use to determine the state of your data will be run a lot. From a lot of different places. With a lot of different parameters. Because the business needs dictate it. This makes dealing with the limits of RDBMS problematic. It would make much greater sense to actually have a simple data structure of all this info. Sure that can get stored in a record of the RDBMS, but it is still flat there. The issue isn’t that the data is dense, sure there are lots of flags and details, but its relatively small. The issue is that there are lots of records. Iterating over them in memory is by far the better choice. I only wish I saw it sooner.

There is one more issue that moving it out of the relational system and into a document store solves: historicity. Both in terms of data and in terms of code. When you want to change the “present” object you are also stuck with changing all the past data that references it. This can have obviously negative side-effects. You can also operate on the presence of a key within the data store (e.g. new versions and releases will introduce new keys) rather than having to run data migration to update the new, now non-historic, RDBMS field accordingly.

I hope that in the future I get a project, and that I have enough foresight, to see this pattern coming and make the appropriate decisions in the beginning to more easily facilitate this kind of architecture.


Client Side Thoughts – All the Rage These Days

Posted: Saturday Jan 28th | Author: JohnO | Filed under: Programming | View Comments

A programming post! I know, its a bit unusual considering all thats happening in the world. But this one is worth it. All the rage these is a gorgeous, smooth, animating, fully javascript-enhanced experience for web applications. And I’ve been building them. There really interesting part is the choice of technologies to use – particularly on the backend. We’ve stayed with the tried and true django. Which, out of necessity, requires that certain decisions be made. And django is a wonderful platform I truly wouldn’t imagine working without.

Lots of the current javascript enhancements are going on with “slimmer” backends (Mongo, Couch, Cassandra, etc). Which means that a lot of the traditional things a backend framework does are not done. So their front-end has to pick up the slack. In many cases this means defining models, and url routes. For those of us using beefy backends this definition is already done, and to replicate it on the front end would be ridiculous. So I feel that the beefy backends of the world are being left out by the adventures in javascript. It should not be so.

I’m already templating on the client side with @getify’s HandlebarJS. That brings along with it a nice Promise pattern implementation. And he has also written a nice Gate pattern implementation as well that I use. Of course I’m using jQuery, along with ajaxForm to handle form posts asychronously. And now we are using Plupload to deal with uploading media (ajaxForm does this with an iframe hack that is not very pretty). I’ve written some nice utilities for dumping django model data into JSON, a Command pattern implementation, and a Databinding implementation on the client side as well.

The trick that I would like to work on, is packaging all this goodness into a unified package that takes advantage of django in a much cleaner way. Getting class information and URL route information onto the client side is the first step. Wrapping the JSON returned from the server with this class info and URL routes, tied together to the Databinding interface gets you a model layer, with instance caches. While that model layer is defined in django. With a couple more nice additions a lot of maintenance/house-keeping code won’t need to be written.

The next thing that has jumped into my mind is pushState. But I haven’t played with it enough to know how to implement it in a “framework” type of way.

This client should be written to be entirely decoupled from the actual backend client. You are dealing in URLs, JSON blobs, DOM elements, forms, events and async calls. The fact that django provides this is inconsequential. You could start with a django backend because of all its benefits of getting up quickly. And you can migrate over to other sharded DBs or non-relational DBs or other data sources (e.g. memcache) as you need to scale.

This project’s first philosophical belief is to get up and running without duplicating your work on the client side (D.R.Y.). It’s second would be to work purely with native types as much as possibile (strings, JSON, DOM). It’s third would be modularity/loose coupling. Since its built on native types as much as possible you can use the parts you want, and only the parts you want.


Versioning and Releasing

Posted: Thursday Jan 6th | Author: JohnO | Filed under: Leadership, Management, Programming | View Comments

OK, I wrote about this a while ago. I still see lots of people wondering about just how to do this. The local Boston django group has about four people who submitted ideas around project management and deployment.

In the abstract you really need three things: 1) a distributed version control system (I prefer git), 2) a database migration system (in the rails world, rake, in the django world south, in the php world you can use manual sql files for both up/down migrations with a simple DB table keeping track – it is essentially what rake and south do, only without an ORM), and 3) a system that gets you onto all of your servers to run the necessary commands/scripts (again, in the rails world capistrano, in the django world fabric, in the php world make some shell scripts).

I have found it easiest to create feature branches in git for new features. Having your main production server running off of master. To write new features you just create a branch off master. If these features need qa, or user acceptance testing, you can create a point release branch to push to those servers only the features which you need at any time (or if it is not that busy, just push the feature branch).

(Aside: I always work in a local branch. So if I am working on master features. I create a master_local. If I am working on feature_a which is a remote branch, I create feature_a_local. This enables you to rebase upstream changes from master/feature_a without messing up the distributed history. It also prevents the useless commits from when pull causes a merge.)

I am against a release branch. I am fine with QA/user acceptance testing branches. This is the difference. If the QA branch gets messed up – fine, create another one. If the branch for the client’s preview gets messed up – create another one. (If you’re squashing your commits it makes it even easier to keep track of things). But once something is ready for release you just don’t want to touch it. Only things you are absolutely positively sure of get on master. Nothing else. That is an easy rule to enforce. Master is production. You mess with master you’re in trouble. With point release branches you get the endless wondering of “wait, which version is production, and which is QA, and which is the clients?”. It is fine that a QA or client branch get messed up because someone merged to the wrong point branch. But when you have master checked out and you type `git merge` you better be doing the right thing. It is a lot harder to confuse `master` with `1.0.11.34` vs `1.0.12.1`.

Some people worry about reverting. With git it is easy to revert. If you are pushing lots of things you should be squashing your releases down to one commit (that is, when you merge your feature branch into master, squash it first). So if your release goes poorly you can migrate off whatever you need (or don’t), and then git checkout the patch before the release.

Some people worry about atomicity. Git is built from a filesystem perspective so there is no lag time where any files are out of version with one another. In any case, once you’re making DB changes in addition to your code pushes – any atomicity (in milliseconds) is ruined since the DB takes seconds (in the case of large indexes, perhaps even minutes) to get up to date. If you’re worried about atomicity stop the webserver gracefully. If you’re worried about down-time – you shouldn’t be. If you gracefully take apache down, and seconds later bring it back up it is unlikely any user will ever notice. If you’re worried about it – release when your system is least in use. (In the case of rails and django you have to stop the webserver when releasing to ensure that your updated source files get re-compiled.)

And I will add one more tip for client caching. Users ought to cache this content aggressively. As a developer it is your job to deliver the new application code to them. You will want to add a version to the querystring to all the CSS and JS that you send out in your application. So when you make a change to a statically served file, you change the version on that querystring (in a settings file) so that no user has this cached.

All in all I really don’t see the huge headache in this area. I hope that we (Active Frequency) can put together a presentation on the specifics of our django deployment process (fabric, south, git/mercurial). (We also use virtualenv to sandbox our pypi and django version dependencies.)


HTML Forms

Posted: Tuesday Sep 21st | Author: JohnO | Filed under: Programming | View Comments

OK I must rant on this. With the rise of incredibly interactive pages HTML Forms are becoming a serious problem for. Maybe I am just doing it wrong, but the fact that forms must contain their inputs is starting to get to me. This becomes especially true when tools such as JQuery plugin AjaxForm is used to make life easier. Browsers mangle the POST data when you put a form inside a form. Meanwhile CSS layout is still heavily dependent on source order. It creates a maelstrom effect when you want a specific look, some data is interactive, and some is ultimately POST’ed normally.

Proposal

<form name="formname"></form>
<input type="..." belongs="formname">

And the browser treats it as if it was contained normally. Because forms are rarely ever necessary to be containers. They are rarely styled. They are purely interactive. Thoughts? Anyone?


Managing Dependancy

Posted: Wednesday Aug 18th | Author: JohnO | Filed under: Programming | View Comments

Huzzah, a post about programming! I am consistently miffed, nay pissed off, when more dependencies are introduced. I do not mean using libraries to Get Things Done. I mean writing what amounts to pseudo-code. Writing structs* (in whatever language) that get parsed and turned into objects and executable code. Unless you’re writing in a language where code is data do not do this. Pseudo-code will never be flexible enough to do what you will need it to. Unless you are purposefully handcuffing the developer – don’t do it.

This is why I love python and django. You can actually execute statements in the class-space. In many other languages you cannot. Hallelujah! I can instantiate objects right there, instead of having to do it later on in another function while a struct* sits in its place. The only dependancy in django is the python language itself. You may, or may not, use the other django libraries. And it is often beneficial to do so. But when it comes down to needing to alter how something gets done you are not dependent on the ordering of the framework itself. Since it can always be overridden at the python level.

*I don’t mean a C-struct, I mean any semantically ordered piece of meta-data that you code relies upon. Generally these take the form of arrays, maps, and hashes.


Technology Woes

Posted: Thursday Jun 17th | Author: JohnO | Filed under: Programming | View Comments

I have to get this rant out. Explosion is imminent. First the lesser offender: Ticket: 13265. I absolutely love how Django rolled in their transaction support. Through the middleware setting, and the decorator. I think it is absolutely perfect from a design standpoint. I, however, lost a lot of hair today over its implementation. It hides perfectly good (read: the bad one you actually want) exceptions from you. I had no idea why it was failing. Turns out, of course, it is still my fault. But don’t hide it from me guys. Show me the error of my ways through the correct stack trace!

Update: Ticket: 6623 is out there (mine is a dupe). You can see no one has touched thing in over a year. It was originally slated for release 1.0. We’re on 1.2. I don’t want to tell you how this makes me feel.

This one is far, far worse. That means it has to do with IE (7 and 8, why bother checking 6. Honestly). And javascript. And form elements. Let yourself be warned. If you ever, at any time, try to manipulate the “checked” status of a radio button or checkbox and that node is not attached to the DOM at any point in its animated and exciting life that “checked” status you so desperately needed falls away into the ether. Thank you IE. Once again, thank you.