an obession with first principles

Data Architecture in Complex Projects

Posted: Wednesday Feb 22nd | Author: JohnO | Filed under: Design, Programming | View Comments

This post is an attempt to be more a comprehensive thought that I expressed here in a previous post. It’s also an attempt to think in-depth about something besides todays annoying politics.

Every project starts off as an MVP. Which means it starts off small in scope, and as simple as it can possibly be. The path of least resistance for every project is a normalized relational database accessed via an OOP ORM. The problem is that this works very well. When projects are simple and small this is undeniably the fastest and most bug-free approach. Some projects grow in size, but remain simple. There are just more UI elements, more records to be tracked. But in a real way, the project remains a simple implementation of simple business rules where there are few considerations, few conditional code paths, and little interdependence between elements of the system.

There are many different kinds of complexity in business systems.

Historical data is a complexity where normalized relational data fails the hardest. In these systems there is a core set of data around which all of the business rules operate. When you have complex business rules and unexpected results you’re trying to explain, if you do not have clear record of the operations performed on the historical data you are going to have a real hard time explaining, understanding, and ultimately fixing the unexpected results. Normalized relational data always retains the current value(s) of the data set, while discarding previous values. And you have to fight hard to try and keep previous values.

The complexity of multi-user interfaces are where OOP ORMs fail the hardest. Once more than one single user is operating on a object/record/dataset you instantly create a problem where the last update wins, and it crushes previous changes. These issues can create serious problems and unexpected results in your dataset without any clear path of resolution. Your solutions are inherently limited, and then further limited based on the constraints of the rest of your architecture. Getting a full eventually-consistent backend is highly unlikely because of the time and effort you’ll need to complete it.

A smaller complexity that does exist is around reporting. Every business process needs reporting. It’s never the first thing you write. It’s always the last. The performance needs of your operational system always come first. Optimizing your relational data, indexes, etc for daily operations means its never optimized for reporting.

Principles That Prevent Problems

Keep that core data your business rules rely around in your language’s native types (for me in Python that means, list, dictionaries, tuples, strings, ints, and Decimals). Doing this means that you can store it in all the various places you’ll need to. If you have to put it in a relational database, you can store it as a JSON encoded string which you can then parse with standard libraries into your language’s native types. Or if your database supports native JSON storage, even easier. If you have a document store available it will go there without a problem as well. Try to avoid storing your base data set as a set of related records in a database. It means that all if your main business rules are then *forced* to be run through your ORM layer code. In a complex system this is very limiting as I outlined above.

When you have recorded all inputs and operations and have the original data set you can always replay those operations. So if a serious business rule changes you can easily run simulations to see what the outcome of that change will be. And you don’t have to write extra/repeated code or migrations to change you data. You’ve already written it, in one place, and tested it. This architecture pattern allows you to write more functional code. Which is easier to write, read, and test. This architecture pattern allows your to move easily towards more micro-services/distributed servers when your load demands it.

Creating a second system for reporting where data is ported (either in batches or real time) into it and then reported is a necessary step. It also creates a good pattern that your reporting data, if it is ever found to be wrong/have bugs, you can always regenerate it–because its existence is based on a record elsewhere.