mevans: Problems with Storing Tree Data Structures in a Relational Database

Lets say we have data which lends itself to being stored as a tree. There are about nodes including the root, children and leaves or so. Not bad. When I arrived on the scene, the data was being stored as a flat data structure, which gave me the willies. Here is the table schema:

 
CREATE TABLE OrgChart(
  Business nvarchar(255),
  OperatingSegment nvarchar(255),
  ReportingSegment nvarchar(255),
  Company nvarchar(255)
)

My knee-jerk problems complaints against storing the data like this:

No constraints! - Anything can be entered for any of the fields so we might end up with "texas" and "TX" for state.
Data is repeated all over the place! - Why would I want to list "United States" for each record?
No primary key!? - You are searching across nvarchars for each query?!
What if I want to be able to persist more data about one of the properties, such as add a description field for a given reporting segment. If I were to follow the same pattern, I would just add another column to the OrgChart table named ReportingSegmentDescription. You can just use you imagination on how disgusting that would be.
The data is LENDING ITSELF TO BEING STORED AS A TREE. It is a hierarchy for pity's sake. Fix it! Fix it!

So. Long story short, I created a much more elegant solution that I thought would solve all these problems and give us the flexibility to expand the data in every direction! Observe:

 
CREATE TABLE OrganizationType(
 Id int not null identity primary key,
 Name varchar(255) not null
 Description varchar(255) not null
)

CREATE TABLE Organization(
 Id int not null identity primary key ,
 Name varchar(255) not null,
 Description varchar(255)
 ParentId int,
 OrganizationTypeId int not null,
 CONSTRAINT FK_OrgType FOREIGN KEY OrganizationTypeId REFERENCES OrganizationType(Id)
)

As you can see this more or less solves all of our problems, doesn't it? We can do all sorts of things with the data in this tree form that would have been highly impractical before, right? This is better, right!? RIGHT!? Sadly, no. Not right. You can imagine my disappointment when I experienced how dreadfully slow reading the tree structure was. Think about it, if I want to read the entire tree using NHibernate, I have to recursively execute queries against the database for each node. Where the original less-elegant data structure was able to pull down the data and build a tree virtually instantly, the new more elegant solution took a good five seconds. FIVE SECONDS. Now, there are likely several things I could do to improve performance such as:

Don't use NHibernate.
Create a Stored Procedure that utilizes a CTE to recursively perform the tree traversal in SQL But.. think about it, if I do this, what data is the sproc going to return? THE FLAT DATA TABLE WE ORIGINALLY HAD...

The reality is that while trees are incredible and give us great power, relational database were NOT designed to work with or store them. Relational database are based in set theory. They deal in collections. They are designed for large flat data structures. That's the point. If you need to store hierarchical data, FLATTEN IT. That is what the database is for.

Now, we are left with a dissatisfaction in the following two options:
1) An ugly flat table that is crying out for elegance and flexibility, or An elegant, but terribly non-performant tree structure.
2) Are these our only options?

Well, yes. And no. Why not use both?

Some possible options:

Store the data as a tree as we have discussed. Do inserts directly to the Organization table and taylor a stored procedure with a CTE to your reading-needs (such as reading the entire tree).
Keep both schemas (with a few modifications) and use SQL triggers to sync them. Perform inserts and updates against the tree schema and perform reads against the flat structure.

I love data structures and making this mistake was indeed a humbling experience. I hate wasting time, but sometimes that is the cost of (re)learning something. Here are some lessons learned from this experience:

Again: Relational DBs are built for flat data structures. Work with the database, not against it.
Just because the current solution doesn't seem elegant (to you) doesn't mean it needs to be changed. Remember the simple axiom: "If it ain't broke, don't fix it".
Our flat table solved the original problem of storing the data and allowing it to be read back quickly. This is 90% of what was needed. Just because something could possibly be useful or because it would be "so much better" is never a good enough reason to initiate a refactor. Don't feel the need to retro-actively over-architect.

This post on stack overflow was very insightful and thought provoking: http://stackoverflow.com/questions/3362669/what-is-the-best-way-to-store-a-tree-structure-in-a-relational-db

Happy Coding!

mevans

Pages

Sunday, September 8, 2013

Problems with Storing Tree Data Structures in a Relational Database

No comments:

Post a Comment

Blog Archive