In previous posts on this blog we have learned powerful and “flashy” concepts such as industry 4.0, artificial intelligence, machine learning, predictive analytics etc. I use the term flashy due to the overuse of these words in the market. This overuse has helped these technologies to be widely adopted and somehow become a necessity accelerating their development which is overall positive for the innovation ecosystem. However, this acceleration can be a double edged sword, the worn of the words could lead to an underestimation of the complexity that means implementing the technology in a proper way, and steer a very promising project to its failure. Today, we will be talking about big data architecture.
We have talked about industries and companies collecting their data and making intelligent decisions based on the insights they have gained through analysis. When the amount of data you’re collecting/processing is larger enough we begin to talk about big data and hence “big data architecture” refers to the plan, to the path or the blueprints for a big data solution. This solution should follow the business needs of the organization always considering the context, technological requirements and many other aspects that should be reviewed as a preliminary step.
But, how big has to be my dataset in order to consider deploying a big data solution? As always it depends, it depends not only in how big is the order of your data but in what operations do you want to perform and also if your solution needs to scale up and how good it should. In Uptime Analytics, we do handle a good amount of data, mostly sensor data from industrial equipment and although there are a lot of variables and these are stored every second we store and process at most 2GB of data per day per data source. Common sense would say that we don´t need a big data architecture however the operations that we perform on the data and most importantly what the solution needs to scale for hundreds and thousands of data sources say’s otherwise.
It would be very time consuming and difficult for a new organization to develop and implement a big data application from scratch, however, the National Institute of Standards and Technology (NIST) have created a wonderful big data reference architecture. This architecture aims to create an open standard so every organization can use it for their benefit. This greatly improves the understanding of the various components, processes and systems that are present in a big data solution and provides a common language for all the interested parties. The NIST big data reference architecture is a vendor-neutral approach so it can be used by any organization that wants to develop a Big data architecture, it is shown below.
It represents a big data solution composed of five functional roles (colored boxes) that are connected by interoperability interfaces this roles are embedded in two layers (fabrics) that are always present (Security and Privacy, Management) and thus affect all the roles and sub-roles.
The first functional role is the “System Orchestrator”, and it is defined as the automated arrangement and coordination of computer services and middleware. Orchestration ensures that all the components in the infrastructure work together timely. To do this, it makes use of workflows and automation. The “Data Provider” oversees introducing new data to the system so it can be accessed and transformed. This can be any type of data such as transactions, sensors, worksheets. Etc.
The third functional role is the “Big Data Application Provider” this role contains the business logic and the core functionality that is necessary to transform the data into the desired results as it can be seen on the figure it incapsulates five sub-roles
- Collection
- Preparation/Curation
- Analytics
- Visualization
- Access
Although the name of the sub-roles makes them self-explanatory the implementation of them can vary greatly depending on the business needs.
The “Big Data Framework Provider” holds the resources that the application provider uses and provides the core infrastructure of the solution, it could be interpreted as the “hardware” of the solution. And as it can be seen it encloses three sub-roles:
- Infrastructure
- Platforms
- Processing
The last role is then the “Data Consumer” and can be viewed as an end user or as another system that feeds on the responses/results that the solution provides this role normally searches and retrieves the information, download the information, is able to visualize it make reports and analyze it. We have only scratched the surface of the reference architecture for more accurate and in-depth definitions. I would highly recommend going to the NIST (https://bigdatawg.nist.gov/V3_output_docs.php).
Although we have this excellent reference architecture there are always challenges and issues that can get out of hand if not addressed properly and timely. Here are the most relevant we encountered:
- Multiple data sources: At start it might seem very easy to build, test and troubleshoot a data source, but when the number of clients and the amount of data sources grows it only becomes harder and harder. Standardization in the data source part is key.
- Data quality: More often than not the client will think that the data he is storing is pristine and ready however most of the times it is not, it is crucial to detect these issues on the early stages and implement a standardized health check.
- Scaling: If we do not design our architecture to scale it might hold and perform with a few clients/data sources but when it starts to receive more data the performance will degrade not only that but the costs of supporting the infrastructure might overcome the profits.
- Security: Easier said than done, secure the data in every stage of the process following data governance agreements especially when you can pull your data from such a variety of components is an arduous task.
- Choosing the right tool: In an era of a wide variety of technologies in which each one is tailored to do a specific task better you would need to pick the right one and the one that brings less uncertainty.
As we have seen through all the article it is quite a laborious task to design and deploy a proper big data architecture, however, it bears good results the first one being the peace of mind that no matter the amount of data your solution will hold and perform. The second is that it might bring savings to the table, at first it would seem like an investment but when it reaches a certain volume of data you will see how the other alternatives will increase its costs and finally encouraging common standards and providing common languages will provide consistent methods to solve comparable problems or expand your solution.
In Uptime Analytics we strive to keep competitive and up to date to the different technologies that emerges in this innovative ecosystem, testing the most out of them, keeping and supporting the ones that bring value to our customers and hence to us, so the industry keeps moving forward.
Written by: Camilo Albarracin, Software Engineer at Uptime Analytics