By Bob Marcus and James Kobielus
Big data is at the heart of many cloud services deployments. Still, it's far too soon to proclaim cloud the be-all end-all platform for big data analytics. For enterprise IT professionals, big-data cloud approaches must prove their value in a competitive landscape where packaged software, appliances, expert integrated systems, and other deployment models have clear advantages in various circumstances.
But, however you scope "cloud," the approach is gaining traction in more industries, applications, and deployment roles. If you're a user organization considering deploying big data in a cloud environment, you need a roadmap for getting there and a reference model for understanding the roles of disparate technologies, tools, and applications in bringing this vision to fruition.
For end-user organizations, it will be critical to have a clear understanding of Big Data application requirements, tool capabilities, and best practices for implementation as private and public cloud deployments become more prevalent. Key tool capabilities are in areas such as security, privacy, performance, usability, interoperability and portability.
The key steps in an enterprise roadmap for big data in the cloud are as follows:
Step One: Identify priority applications
You must first identify big-data applications where cloud approaches have an advantage that other approaches, such as software on commodity hardware or pre-integrated appliances, lack. Here are just a few scenarios where cloud-based approaches might be suitable for your big-data analytics requirements:
- Enterprise applications that are already hosted in the cloud
- High-volume external data sources that require considerable pre-processing
- Elastic provisioning of very large but short-lived analytic sandboxes
- Queryable off-premises archive
Step Two: Align approaches
Big data and cloud, each in their separate spheres, are sprawling paradigms. Getting your arms around their intersection--big data in the cloud--involves both understanding the core architectural principles of each approach and identifying the synergies among them.
Their respective core principles are:
- Big data: If we define big data as any approach for maximizing the linear scalability, deployment and execution flexibility, and cost-effectiveness of analytic data platforms, then the following should be in our short list: massively parallel processing, in-database execution, storage optimization, data virtualization, and mixed-workload management.
- Cloud computing: Here's where we'll cite the US National Institute for Science and Technology (NIST), which defines cloud computing as "a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction."
When you stack those principles next to each other, you quickly realize that many big data analytics platform providers have long been focused on bringing cloud-ready architectures into the heart of our offerings. Likewise, cloud platform providers have integrated every larger data sets and more advanced analytics into their various offerings. So the synergies and overlaps among the distinct approaches are already baked into their DNA, as it were, and are supported, to varying degrees, in the respective platforms.
But that's not necessarily the same as explicitly doing "big data in the cloud" as a coherent architectural approach. A more unified framework would need to functionally align the service layers of a big-data-analytics reference framework (i.e., data, metadata, models, rules, etc.) with the various layers (i.e, application as a service, infrastructure as service, and platform as a service, etc.) of a full-blown cloud computing framework.
Step Three: Integrate disparate service layers into a coherent reference model
Big data in the cloud has so many potential functional service layers sprawling across so many nodes, clusters, and tiers that it's easy to feel overwhelmed. Where do you start and how do you architect it all according to a coherent reference model?
Big data is increasingly living inside comprehensive cloud architectures. Big-data clouds are increasingly spanning federated private and public deployments, encompassing at-rest and in-motion data, incorporating a growing footprint of in-memory and flash storage, and available on-demand from all applications.
Smarter big-data consolidation will requires that you preserve and even expand, where necessary, the distributed, multi-tier, heterogeneous, and agile nature of your big-data environment by implementing a virtualization capability in middleware, in the access layer, and in the management infrastructure. Virtualization provides a unified interface to disparate resources, so that you can change, scale, and evolve the back-end without breaking interoperability with tools and application. One of the key enablers of big-data virtualization is the semantic abstraction layer, which enables simplified access to the disparate schemas of the RDBMS, Hadoop, NoSQL, columnar, and other data management platforms that constitute a logically unified data/analytic resource.
Comprehensive cloud analytics will be the big-data paradigm for the new era. Key to this emerging approach will be the evolution of the big-data fabric— through integration of analytic and transactional workloads—into a new type of distributed application server. MPP EDW, Hadoop, NoSQL, in-memory, stream computing, and other big data platforms are stepping stones that will fit into this increasingly heterogeneous business platform.
Generally, we can organize these technologies into distinct layers within a big-data cloud reference model:
- Cloud computing platforms
- Data storage platforms
- Database, document, and file stores
- Transaction processing platforms
- Stream computing and complex event processing platforms
- Analytics development and modeling tools
- Data discovery, integration, and governance tools
- Deployment, management, and optimization tools
- Virtualization, abstraction, and federation tools
- BI, search, query, reporting, and visualization tools
Step Four: Govern a sprawling business resource
Big data in the cloud is a complex, tricky thing to manage as a unified business resource. It demands unified governance and security. The more complex and heterogeneous your big-data cloud, the more difficult it all is to crack the whip of tight control.
You can govern petabytes of data in a coherent manner. There is no inherent trade-off between the volume of the data set and the quality of the data maintained within. The source of data quality problems in most organizations is usually at the source transactional systems—whether those are your customer relationship management (CRM) system, general ledger application, or whatever. These systems are usually in the terabytes range.
Just as important is governance of MapReduce and other big-data analytic models that execute in your cloud. Big data applications ride on a never-ending stream of new statistical, predictive, segmentation, behavioral, and other advanced analytic models. As you ramp up your data scientist teams and give them more powerful modeling tools, you will soon be swamped with models. Big data analytics demands governance--let's face it, some level of repeatable bureaucracy--if it's designed to produce artifacts that are deployed into production applications.
To avoid fostering an unmanageable glut of myriad statistical models, your big data sandboxing environment should support strong life-cycle governance of models and other artifacts developed by your data scientists, regardless of what tools they use. Key governance features include check in/check-out, change tracking, version control, and collaborative development and validation. Your sandboxing platforms and modeling tools should ensure consistent governance automation, and managed collaboration across multidisciplinary teams working on your most challenging big data analytics initiatives.
On Monday, March 18, the newly formed Big Data Working Group of the Cloud Standards Customer Council (CSCC) will host a forum that focuses on define a practical roadmap for the future of big data in the cloud. The forum, which will take place in Reston, Virginia, will draw on real-world case-study experiences and bring together representatives from user organizations in diverse industries who will share best practices for harmonizing and managing investments in big data and cloud technologies. The CSCC Big Data Working Group's goal is to help end-user organizations align their cloud-based big-data initiatives with key strategic imperatives in a fast-changing competitive and operational environment.
The March 18 forum will include ample time for interactive discussion among participants. The event will also represent the kickoff of the CSCC Big Data Working Group. Attendees and other interested parties are encouraged to join the WG in order to contribute to the ongoing work of this body in influencing industry development of a vision and roadmap for best practices and standards for deploying and managing big data in the cloud.
To join us in this discussion, please join us at the Cloud Standards Customer Council meeting on "Big Data in the Cloud: Preparing for the Future," March 18 in Reston, Virginia. The event registration link can be found online here.