Anthony T. Goh, Software Developer.

View Original

Breaking away from SQL when first using DynamoDB

DynamoDB is a NoSQL Database that has some very interesting applications. It is a database - in the sense that it is used to store data - and it is accessed through queries that can pick through the thousands or hundreds of thousands of data points to return the information you need.

However, do not mistake DynamoDB as a Relational Database like an SQL database. If you treat DynamoDB tables like a simple Key-Column SQL database you will not only be loosing out on a vast majority of the advantages that Dynamo offers, but also grossly misusing the system and potentially run the costs up for your product and slow down its response times.

If you’ve ever had any classes or done any significant work with SQL databases, it can be very easy to get tripped up and try and force Dynamo into the same patterns - which is entirely the wrong direction to go.

DynamoDB and an RDS differences

SQL Databases were designed for one thing: Efficient storage. In the early days of databasing and the internet, storage was expensive. Speed was not an issue - if you went from 3 days to get a response via fax (or even longer via mail) going to 5-10 minutes was blazingly fast in comparison. In contrast storage was expensive - harddrives were small and to store hundreds of thousands of entries it was important that it was very efficient. SQL accomplishes this very well. But anyone whose ever had to write a complex SQL query on an extremely large database knows it can take forever to bring back results.

Today, storage is cheap. You can get petabytes of storage for what amounts to pennies. Speed is what matters. No one wants to wait 10 minutes for their account data to load on their app. No one wants to wait an hour to get results. They want that information available as soon as they click the button. DynamoDb aims to be fast - very fast - but it can only do so if the data is stored in ways that take advantage of its power.

Scaling

Everyone talks about we need our applications to be able to Scale. Big word, big connotations. It used to be that Scaling was about being able to go from a handful of users to thousands or more and handle the stress: Stress in this case being ability to hold all that data and to handle that number of requests. SQL Databases scale Vertically to handle the data influx - they stay as one server, one connection point, one processor iterating over the data to find what you need. When storage was expensive, when additional servers was expensive, this was fine.

Now, we need to be able to scale to handle millions of simultaneous connections and do so fast. A single server can’t handle this, so we have to scale Horizontally. That is to say more servers that can simultaneously accept connections and process the same data and return the same results, all the way pretending to be a single sever. SQL databases do not handle this process, sometimes known as creating Shards, well. Custom applications and scripts have been written all over the place in order to replicate data changes among multiple SQL servers so that all the data stays up to date - but there is always lag time, and always a risk of information mismatch.

This is where DynamoDB shines - it was designed from the ground up to Shard automatically and with precision. It scales horizontally with amazing efficiency and does so without overly difficult procedures to prevent data mismatch.

Looking Up Data: Query vs Scan

In SQL there is only the Query. The query can directly look up an entry select select [specific key] from [table] or you can dynamically search the records for some qualifications select [some qualifications] from [table] where [filter qualifications]

In DynamoDB there are two types of look up. Query and Scan.

Query acts much like a typical SQL query on the surface - Assuming you have a proper schema in your Dynamo you can do effectively the same look ups as above. But in order to do so you need to know at least the Partition Key and some information about the Sort Key; more on this in the next section. For now, understand that if you have a proper schema in your Dynamo, and you know at least part of the information you are looking up, you can do very quick queries -

Scan is much like it sounds: It scans the entire DynamoDB item by item ( an O(n) function, where n is the size of the table ) and it can be filtered. When many devs first come to Dynamo from RDS they use Scan for everything - if you have a less than optimal structure to your Dynamo Schema then Scan is often the only way to find information you need. If you don’t have a Sort Key for instance.

Scan effectively mimics an RDS -in speed as well. Meaning every time you have to use Scan you are not taking advantage of the speed that Queries can offer. It is sometimes necessary, but its costs and time can really add up.

To really understand the differences here, we have to understand the Keys structure for DynamoDB and how it differs from SQL.

DynamoDB Structures

Keys

In an SQL DB you have one Primary Key. This is how the hash map will identify your data set for that particular item. It must be unique to the table. You link this key to other columns in other tables (the Relationship of an RDS) and there are rules for propagating data changes in one table to others. Because it is a Hash Map it is instantaneously (O(1) look-up-able with the exact key value.

In DynamoDB there are two keys, that together form the unique key for the item. The Partition Key (PK) and the Sort Key (SK). If you have just a PK and no SK, you have a Primary Key - that functions in effect exactly like a Primary Key from SQL, just without the relational aspect.

Put both keys in your table and you have a Composite Primary Key - just like you would in SQL. Each Composite Primary Key (combination of the PK and SK) must be unique; Dynamo’s default is that if there is an attempt to add an item with a duplicate Primary Key (PK+SK) it will simply update the existing one. (Useful - you don’t have to search to see if it exists before updating, just add)

Partition Key

The Partition Key (PK) handles data much like a Primary Key in SQL. It is a hash map of the item, and is stored in an attribute called the Hash Attribute (more on attributes in the next section) and querying against it is a O(1) action - if you don’t have a sort key. (it's O(n) if you have a sort key, but your n should be relatively small so in reality its no real different than O(1) ) - If you have a Sort Key then querying against the PK and SK combined is exactly the same - it is an near instantaneous look up and returns that data instantly.

However, you cannot query dynamic operations against just a PK - you cannot do searches or filters directly against the data in the table in a Query if you do not have a sort key - no greater than or less than or the like. Only direct equals. The Hash Attribute is not sorted in the internal data structures of Dynamo - each hash attribute is put into the system as it comes in - as such if all you have is a PK and no Sort Key, you have to use a Scan and iterate across the entire table to discover the information you want. O(n) isn’t a horrible operation, but the more Scans you have to do the more intensive it becomes.

Sort Key

The Sort Key (SK) is stored in an attribute of the item known as the Range Attribute. This attribute is special in that when Dynamo stores the information, it keeps it sorted in its internal storage - allowing quick and fast look ups. Sort Keys cannot exist without a Partition Key (more on that soon), but when paired with a PK they allow fast and efficient dynamic look ups. You can use Greater Than, Less Than, Equals (and combinations therein) as well as Between and Begins With.

Attributes

Attributes are often seen as Columns by former SQL devs. And they serve a similar function - they hold the data for a given key. The difference is that Attributes are item based (excepting the Hash and Range Attributes) and not table - meaning every item in the table can have completely different attributes and there is no issue.

The only two attributes that are required are the Hash Attribute and the Range Attribute (assuming your table as a Sort Key - other than that all attributes can be dynamic to the item involved.

Each Attribute can have a data type for more efficient storage - such as String, Number, Boolean, Binary, Map, and others. Even the Hash and Range Attributes can have these data types, though in reality given the more advanced ways of setting up a DynamoDB Schema, string is the most useful.

Indexes

The last obvious structure of a DynamoDB an Index. Indexes are ways to organize data using alternate PK and SKs - allowing the quick look up of the same data in different ways. There are Local Indexes and Global - Local exist simultaneously in your table but cannot be accessed in a different region. Global are accessible outside your region but are in effect shards - and this can have some data lag issues.

You cannot set up a Local Index after the table has been created - they are part of the base of the table and must be created at creation time. As such, it is actually pretty useless to create Local Index for most situations - and Global will do the same job and give you regional Active-Active Failover strategy implementation.

Index’s are best to create for data that needs to be looked up often but is cross item attributes. You can consider Index’s as a version of Relationships, allowing fast look ups akin to an SQL query like select [col].[table1], [col2].[table2], [col3].[table2] from [table1], [table2] - the key is that you have to know your query before hand to set up your Global Indexes correctly.

How DynamoDB Stores Data for super fast access

Recall that one of the main selling points of DynamoDB is super fast access. Recall also that this is achieved through Horizontal Scaling, or creating Shards of the table for simultaneous access. A Shard is a direct copy of the table on a different instance - accessed in parallel to other shards. DynamoDB automatically Shards for every request - this is the same functionality AWS uses in other systems for dynamic scale growth to meet demand - create another instance of the same structure and have it be called at the same time so no request has to wait in line.

Sharding has a major downside in most situations - Data has to propagate through the shards, and this means there can be lag time, where two simultaneous requests may update the same data with different items. Dynamo has a lot of protections against this in the background, but with a proper Schema and taking it into account in your architecture it can be a relatively minor problem.

Partitions for Super Fast Access

When you have hundreds of thousands of items in your DynamoDB, even a sorted Range Attribute takes longer and longer to process. This is where the Partition Key comes into play. It is specifically called Partition Key because it is literally a partition of data. Because any item in the table can have the same PK as long as it has a unique SK attached to it, DynamoDB can organize data into chunks - partitions. When making a Query against a particular PK and against its SK DynamoDB only calls up the data in that partion - a much much smaller data set.

Changing How You View DynamoDB

If you’ve been following up to this point, you’ll notice that not once has a Row been mentioned. This is intentional - Rows are key to SQL databases, and how you store data - but each row has to have the same structure as the one above and below it. Items in DynamoDB do not (attributes instead of columns).

Here is the tricky part. Divorcing your preconceived notions of SQL and Relational Databases from Dynamo. Throw it out (for now). It is not a database of tables. It does not have relations. It does not have rows. It is not constrained to these forms.

The first step is to not realize that PK and SK do not always have to be the same category of data for every item. They are not columns. They are attribute. If you generically name your PK and SK … PK/SK then you an use whatever value you want in each attribute. Employee Name / Department / Office Location all can be in PKs. Sure, this means you have multiple items with the same information in them - but remember Dynamo is not about saving space. It is about speed of access. You may want to look up all the employees at a given office - and so you use the PK of Office and you get all the items with that PK and unique SKs. Maybe you want to do it by department - there will be duplicate information from the same look up as office space, but its a different set - and its fast.

There are two ways to think about DynamoDB that may help divorce the SQL notions.

File Structure

Think of DynamoDB not as a database, but a Directory Structure. Each item is a Document and the attributes are data inside that document. The Sort Key is the name on the document, and the Partition Key is the directory in which it is found.

Directories…

Documents…

… and Information.

If you have a DynamoDB Table of Employees you may have a Partition Key of Department, and a sort key of Employee Name. You can translate the above metaphor into the following design for DynamoDB Schema. In this case, you will be able to quickly look up all the employees of a given Department just by retrieving all the items of the PK of the department name. Just like if you opened the Sales Directory you would have a quick view of all the employees, and could go into individual ones for more detail.

Advanced Schema’s and Utilizing Global Index

Twisting the Department/Employee schema you can take it a more dynamic step. You have a Partition Key of Employee (either name or ID number). Inside you have several different Sort Keys. You have one item with a PK of Name and the value in the SK is different. One document in this PK has a sort key of Department, and its attributes include the department name and any relevant information there in.

Another SK may have a value of Compensation, and its attributes are different from the Department document. If you continue to imagine these as different documents in the directory titled ‘[Employee Name]’ you can start to see the best practices for DynamoDB Schemas.

An example of what DynamoDB may look like in this way of organizing…

If you take this a step further - you have both a set of documents under Department PK and Employee Name PK (all in the same Table!) - in the Department documents there is an SK whose value is an employees name. In the Employee documents, you have an SK whose value is Department, and it has a department name as an attribute. You also have an SK with a value of Compensation and an attribute of their pay.

Expanding on what the Dynamo Table may look like…

Let us say you want to look up all the departments who has an employee in a paygrade between 60k-70k a year. Well, you could query each department individually and sort scan each employee in that department for their compensation, then filter out the ones you want and store that data in a local variable until done searching every department. You could do the same by scanning every Employee-Compensation and filtering out the ones you need.

This is slow, and with more records becomes even more slow, and risks multiple reads (ie $$$) for the same search.

Or, you create an Index. If you have an Index that has a Partition Key taking from the Compensation and a Sort Key of Department, you can do a query against the PK of 60k-70k… and every single department with an employee with that compensation will immediately be returned as a list.

Global Index


You can obviously start to see some issues here, where if you have more than one employee in the same pay grade in the same department what do you do? You start adding tags to your keys. This also helps sort the items and allows it to be much more clear in human readable terms. If you use the Object Oriented Programming example, think of this as adding the Type to the values

Using Prefixes

Its important that when you do this you still have an attribute with the value after your qualifier so you can access it directly and use it in Indexes.

This example starts to break down at this point - its not the best of scenarios for use here. But it helps to illustrate some of the more advanced ways of setting up Dynamo Schemas.

Object Oriented Structure

The other way to think of it is as Data Objects. When you do a lot of OOP development, you eventually get to the point where you have so many objects that are related that you create another object to pass them around. This new meta object is just a collection of other objects so you only have to pass one parameter to your function instead of many. Its cleaner code and it makes organization much easier.

DynamoDB items can be thought of as Objects as well. The Primary Key is the name of the instantiation of the meta object, and each sort key is a different object attribute inside with its various attributes.

So to expand, if you have an Object called Animal: this has the type of animal (Dog Cat) and the vital statistics. You have another object called ‘Owner Info’ - with the animals owners information within it. You have several animal objects and at least one owner info object per account, and you want to be able to quickly pass them all together - so you create an Account object and add a list of Animals and Owner Information as attributes.

Structs and Meta objects as a way to think about Dynamo Structures

This translates into Dynamo very quickly: Each Object in the meta object is a Sort Key value, and each Meta Object instantiation is a Partition Key - With the above advanced example, we can see how we can have multiple of the same types of object in each Sort Key by using tags:

Class names become prefixes

You can then for instance search for all the animals on a given account by querying the PK ACCOUNT#[Number] and Sort Key that Begins With ANIMAL#

Key to Planning this Out:

It becomes obvious now that in order to have a good DynamoDB Table, you need to really plan out the possible situations you need to use it for. Adding new Keys or changing the schema is incredibly intensive when you hit a large number of entries.

The key is knowing what Information you have in your table, and how you need to access it. You have to think through as many of your potential access patterns before hand.

Amusingly, the most efficient way to do this still can reside in the Relationship Diagrams used to set up schema for SQL databases. You can look at the above example and immediately see One to Many relationships ( Department : Employee is a 1:N ), Many to One (Depending on the attributes, you could easily have Vacation Days, Compensation be Many to One with Employee.)

Once you have planned, researched, and understood the access patterns you will need for your data then you can try out a few Schema’s that give you more information.