Solve Big Data Evolution Challenges with Blockchain-ified Data
Big Data means Big Business
Early and mid-2000s was the time of exponential surge in data production. This exponential rise, owing to advancement in wireless technologies, automation and internet speeds, gave birth to a tremendously fast growing technology vertical, big data.
Big data, due to its sheer size, had huge computation needs, which gave us Zookeeper, BigTables, GFS, and Cassandra, and then, the most famous open-source project after Tomcat, which is Hadoop.
By the early 2010s we had plethora of startups like MongoDB, DataStax, ElasticSearch, Cloudera, HWX, etc., who adopted the open-source tools and made them enterprise ready. In past 7 years, the big data ecosystem and its adopters have matured and slowly moved towards much more advanced use cases around Deep Learning and AI.
As a matter of fact, it is safe to say that big data and the opensource community are literally transforming every enterprise backend on this planet. For example, we have helped many enterprises transform “data warehouses” of relational databases into “data lakes.” Every enterprise is somewhere on this transformation path already. Just to put this in perspective, more than $100 billion is being spent on this transformational journey, including hardware, software and services.
But, everything is not merry. There are some serious and complex challenges around the next phase of evolution of big data. These include: control, authenticity and monetization, just to name a few:
First Challenge: Control over the infrastructure itself in multi-tenant environment:
- Let’s assume a scenario in which you are a multinational enterprise, which produces data in multiple geographies, governed by multiple regulations. Now, how would you share data around the planet? It’s nearly unavoidable that you will have multiple copies of your data, so how would you know which one is most up to date, most complete, most clean? How do you achieve simple procedural and role reconciliation, with a system admin role at each of your regional offices?
- Imagine you are a little broader. You’re an industry consortium. The challenge above becomes even more difficult – especially because many of the consortium participants are competing with each other.
- Creating one single shared source of truth for all the data being generated, and exposing Data as a Utility, similar to Internet.
Second Challenge: How well can you trust the data itself?
- Imagine that you collected some data. How would you prove you were the originator? Or what if you got data from “others”? How would you know its true origin??
- The one universal truth, to which every system admin will testify, is “CRASH”. In current technical landscape, crashes and malicious behavior is not only frequent and normal; it is inevitable. So what? So that means that you will have machine crashes, programmatical errors, hacks etc. In fact there is a new term which came to existence with evolution of IoT systems called Zombie IoT toasters, which, without your notice, keep pumping garbage into your systems. That means that all of that fancy modeling which your highly skilled data scientist ran on the GPUs just produced a bunch of garbage.
Third Challenge: What’s the value of your data, and how do you monetize it?
- So you produced a bunch of data, which is of immense importance for a lot of valuable work in your industry, and everyone wants to use it. Great, right? Of course, but take a step back and consider how you transfer the rights of the data, or buy the rights from others?
- The point above leads us to a larger much more universal question, and that’s about a larger dream of having Universal Data Marketplace.
So how do we addresses the challenges above?
Big data seems is currently closed and restricted, with many access, authenticity and monetization challenges. But, wait, as Plato once said “Necessity is the mother of invention”, the above necessity gave birth (kind of) to a new tool for big data, Blockchain Technology.
While discussing pure blockchain is out of scope of this blog, but I will give a quick introduction to it. Recently blockchain got a huge upsurge in its popularity because of Bitcoin, which utilizes Blockchain. Technically, you can call blockchain a database, but this is not any normal database, this is next generation of information decentralization. I would not hesitate calling it the only data base that you could call Blue Ocean, with beneficial attributes such as include decentralized and shared control, immutability and audit trails, and native assets and exchanges.
Just for the sake of the terminology, blue ocean model means: a new, uncontested market space that makes existing competitors become irrelevant and creates new consumer value often while decreasing costs.
However, blockchains have terrible scalability and don’t even have a query language. But even then, the blue ocean benefits have proved enough to capture global imagination.
The good news is, it’s relatively easy to marry Hadoop’s scalability and blockchains to create something which can be loosely called a blockchain database, which provides the best of both worlds.
A simple NoSQL database like MongoDB can very easily enable query ability and schematic representation for blockchains. This unlocks potential for highly interesting potential for applications in big data. Examples include shared control over infrastructure, audit trails on data, even possibility for a universal data exchange.
The figures below show how blockchain changes the game and how decentralization materializes.
The figure above shows, that as we move from left to right, the components from blockchain architecture become more prevalent and thus open up the system. While on the left side we still have heavy siloes of application. We see that the processing layer changed over form directly hardware dependent to much more abstract and distributed (ex Ethereum), similarly the file system evolves from local namespaces to a global namespace kind of file system example IPFS (Inter Planetary file system).
Below is how blockchain complements and compares with a scalable database.
Addressing First Challenge: Shared Infrastructure Control
Being a blockchain database means the control of the database infrastructure is shared across the entities, whether within enterprises, consortiums or even across the planet. Cool, isn’t it?
How? Blockchain based architecture (of a database) is decentralized, which means that its control can be shared.
This sharing can happen in one of several ways:
- Across offices within enterprises. Which solves our geographically spread locations issue.
- Across companies within an ecosystem, i.e. between companies (even competitors)
- On a planetary level. Shared control of an open, public big data database which fructifies the dreamy concept of Data as a Utility. There is already an application of this concept which is called IPDB (Interplanetary Database).
Since this is a big data database, unlike traditional blockchains it can hold the data itself. As, database itself fills up we can keep adding more and more database connected using an open protocol called interledger.
Benefits to this approach:
Problem. The multinational entity problem
Solution. Each regional office with its own sysadmin controls one node of the overall database. So they control the database collectively. The decentralized nature means, that if a sysadmin or two goes rogue, or a regional office is hacked, the data is still protected. (Assuming encryption is in place)
Problem. The Industry consortium problem
Solution. Similar to above each company controls one node in the chain
Problem. Single shared truth of data problem
Solution. A universally distributed interledger of databases, where essentially everyone can be part of the universal data market place. Example IPDB.
Addressing Second Challenge: Audit Trails on Data
Blockchain allows us to have detailed and definitive audit trails on data, to improve the trustworthiness of the connected nodes. Similarly, this principle applies to your data residing in blockchain database.
How? Let’s consider a simple data pipeline: IoT sensors -> Kinesis/Event hub + Stream Analytics -> Isilon Storage (HDFS) -> Spark Data prep -> Spark Modelling -> MongoDB Storage -> Tableau. Shown in figure 3 below.
So here is what happens:
- Input data is always timestamped
- A transaction is created as a JSON doc, which includes
- Hash of data
- Hashes of each row and column (depending on the data)
- Any metadata present, or that one wants to include
- Cryptographically, every transaction is signed with your own personal private key
- Write this transaction to the database (Mongo Node on our case), this automatically timestamps the transaction. This gives us immutable evidence that you had access to that data at that point of time, which others can cryptographically verify based on your public key. Howzat!!!
So the output of each pipeline step is timestamped in the three steps mentioned above.
Benefits to this approach:
- How would you prove you were the originator?
- People who have your public key can see that you cryptographically signed it
- How do you know if it was others who owned the data you received?
- The same way as above, you have their public key
- Crashes, malicious behavior, glitches etc.
- This is my favorite. You can run periodic processes to rehash the data stored in the pipeline. If the new hash doesn’t match the previous hash, something is wrong.
- Zombie IoT toasters. Garbage in Garbage Out
- First, make sure IoT devices are properly secured. Each IoT device must have a way to sign the data (there is a straightforward technique to do that) and the private keys must be consistent. Then just like before, one can verify the data (rehashing method).
The figure above shows a generic, agile and modular architecture which increases, security, reliability and trustworthiness of an IoT architecture.
- Most important piece of this architecture is blockchain database. This provides an open decentralized protocol to access various underlying assets.
- Boomi, provides a platform independent multi layered and dynamic integration, which enables getting data in multiple versions, velocities etc
- The storage layer is a decoupled Isilon storage, which can be easily expanded and distributed.
- All compute is virtualized providing much more utilization and efficiency
Last but not the least my favorite topic…
Addressing third Challenge: Universal Data Exchange
This novel method enables us to build universal data market place which helps evaporate walls of data silos. A scalable consistent blockchain database architecture speaking the protocol of IP rights transfer enables data to be bought and sold as an asset. The concept is new and amazingly exciting. Not only a universal marketplace, it’s also collectively controlled by a public ecosystem. People and corporations can build data exchanges on top of this universal marketplace to suit their needs.
How? Here is how it works.
We need to build a global public blockchain database, which currently exists as an open non-profit initiative in the form of IPDB. Remember this can be securely implemented by consortiums as well (separated from public infrastructure). There can be even multiple networks, where assets flow utilizing interledger protocol.
The asset is the data rights, backed by copyright law. The asset lives on the blockchain db. Remember, you own the private key for the data you own. You can transfer you rights, data and its slices using open blockchain IP protocol. Some opensource protocols are available example Coala IP.
Benefits to this approach:
- How do you transfer rights and buy rights?
- Just create a transaction, in which transfer rights to another person, say speaking the language of the Coala IP. Sign it, write it to the database. Clean and clear
Figure below shows how a consortium of similarly a multinational works when it comes to enabling a universal data marketplace (The Dream).
Definitely big data means big bucks. Blockchainified big data helps resolve three of its outstanding challenges: How to control the Data, how to trust the data, and how to build universal exchanges.
We at Dell EMC are firmly dedicated towards making data much more accessible, making it open and enabling large enterprises realize the real potential of data as an economic asset. So, Chains of Big Data is turning out to be the best approach towards creating an open, connected, trustworthy and universal data marketplace, which in turn enables better collaboration and value out of humongous amounts of data being produced every second.