Whenever we take an assignment on any product, we should select a database strategy. There are several factors which we need to consider while deciding between a traditional RDBMS database, NoSQL, or schema-less database. There also comes a question if we need to go for a Big Data solution. Though Big Data is also NoSQL and schema-less database solution framework, its approach to processing and storing data is quite different.
I would like to share our experiences and in a way to try and introspect our decision while we come across such a decision-making situation. Let me recapitulate well know computer science theorem “Brewer’s theorem — popularly known as CAP theorem”.
It states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees
1. Consistency — Every read receives the most recent write or an error
2. Availability — Every request receives a (non-error) response — without a guarantee that it contains the most recent write
3. Partition tolerance — The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
Let us try to understand the above terms in very basic layman terms
To start with
Assume that you are using Slack Bot-like service with people who take care of operations in your organization’s internal operation maintenance which primarily acts as saving one personalized data point when a request comes from employees and notifying if any action items are being missed.
Consider the following scenario:
1. Notify me to fill JIRA if I do not update by 6 PM
2. Remind me to commit the source code to bitbucket every day
3. Warn me if I am not giving pull request to the production branch for a maximum of 3 days
Assume that you stated such operations with one slack bot and there is a person who is doing actual work to collect the data and put reminders to individuals.
The service seems to be consistent and available when you are a 5–7 member company.
When do you scale up?
You understand that there is a challenge to ensure your availability and in providing a consistent update. You will decide to increase the number of members to serve.
There comes a question of “Bad Service”
You have already introduced a distributed service. There might be a chance of miscommunication between people who are receiving the orders and because of which employees might not get consistent data that they intended to.
While fixing consistency issue…
The bad service could be addressed by updating the info to another person when input is being received from the employees. Computation could be done by a person who is receiving a request but he/she needs to update another person to serve the employee consistently. Now, what happens if in case anyone is not available on any particular day. They cannot update each other.
What could be your smart solution?
Your Bot serving members are asked to coordinate themselves. If in can another member is not available when any data is being updated, he/she needs to put a slack message to other person (person X)on the update. The next day while person X resumes the work needs to update the slack updates and then start to serve.
However, in case, person X miss to update the messages while resuming the work? This is a fine example of Partition tolerance!
Hope we can now better understand the issues and terms.
We consider the following factors while deciding database solution and architecture
Traditional RDBMS solutions provide consistency and availability. RDBMS is modeled around schemas and tables to organize and structure data in a combination of columns and rows.
You might need to write big search queries but developers would love to do it in a very well-known system. Consistency and availability are big gains.
These systems often run into bottlenecks with scalability and data replication when handling large amounts of data/data sets.
We have designed “the laundry basket” solution with RDBMS. Though it is a complex system with the franchise model, we had clear input and were very well aware of the system.
Now the question arises between the NoSQL databases. We will have a choice to choose between Key-Value Database which offers simplicity and Document Database which offers flexibility.
The choice between key-value and document databases comes down to your data and application needs. If you usually retrieve data by key or ID value and don’t need to support complex queries, a key-value database is a good option. If you don’t need search capabilities beyond key lookup, a key-value database that supports searching may be sufficient. If you have different types of entities and need complex querying, choose a document database.
Document database stores data in collections, in which different data fields can be queried once, versus multiple queries required by RDBMS that allocate data across multiple tables in columns and rows. The data is stored as Binary JSON (BSON) and is readily available for ad-hoc queries, indexing, replication, and MapReduce aggregation.
Database Sharding can also be applied to allow distribution across multiple systems for horizontal scalability as needed.
MongoDB and DynamoDB document databases are built with a slightly different focus. Both scale across multiple nodes easily, but MongoDB favors consistency while DynamoDB favors availability. In the MongoDB replication model, a group of database nodes host the same data set and are defined as a replica set. One of the nodes in the set will act as primary and the others will be secondary nodes. The primary node is used for all write operations, and by default all read operations as well. This means that replica sets provide strict consistency. Replication is used to provide redundancy — to recover from hardware failure or service interruptions.
DynamoDB uses a replication model called Eventual Consistency. In this system, clients can write data to one node of the database without waiting for other nodes to come into agreement. The system incrementally copies document changes between nodes, meaning that they will eventually be in sync
If we are designing any solution like payment, trading in financial data, or online commerce, you might want to ensure that all clients have a consistent view of the data. If your solution involves trading in financial data or online commerce, you might want to ensure that all clients have a consistent view of the data. In such cases, MongoDB is preferred over DynamoDB. In other solutions, the high availability offered by DynamoDB might be more important, even if some clients are seeing data that is slightly out of date.
When to go for Big Data — Hadoop
Is a framework comprised of a software ecosystem? The primary components of Hadoop are the Hadoop Distributed File System (HDFS) and MapReduce. Secondary components are a collection of other Apache products, including Hive (for querying data), Pig (for analyzing large data-sets), HBase (column-oriented database), Oozie (for scheduling Hadoop jobs), Sqoop (for interfacing with other systems such as BI, analytics, or RDBMS), and Flume (for aggregating and preprocessing data). Like MongoDB, Hadoop’s HBase database accomplishes horizontal scalability through database sharding.
One of our real estate solutions is intended to bring transparency for consumers, agents, giving them the data and tools they need to navigate the real estate marketplace. intelligent data-driven decisions were key to this.
The real estate solution has homes for sale, homes for rent and homes not currently on the market. The database is built from a range of disparate sources, incorporating streams of county records, tax data, listings of homes for sale, listings of rental properties, and mortgage information. The transaction, listing, and attributes are overlaid with a nested geographic hierarchy from neighborhoods and census tracts to cities and states.
With so much data to store and process, we adopted Hadoop. We are using multiple clusters to deliver personalized recommendations to customers based on sophisticated data science models that analyze more than a terabyte of data daily. That data is drawn from new listings, public records, and user behavior, all of which is then cross-referenced with search criteria to alert customers quickly when new properties become available.
We need to have a lot of consideration and research while deciding the best option for a solution. If you are looking for a solution for batch, long-running analytics while still being able to query data as needed then Hadoop is a better option. If you have requirements for processing low-latency real-time data you can go with the MongoDB solution itself
Reach out to us at: firstname.lastname@example.org