DIY vs. fully integrated Hadoop – What’s best for your organization?

NetworkWorld logo

By Adam Lorant, VP Product Management & Solutions, PHEMI

The trade-offs of building it yourself vs. going with a pre-integrated, out-of-the-box platform

You don’t have to look far to see the amazing things that organizations are doing with big data technology: pulling information from past transactions, social media and other sources to develop 360-degree views of their customers. Analyzing thousands of processes to identify causes of breakdowns and inefficiencies. Bringing together disparate data sources to uncover connections that were never recognized before.

All of these innovations, and many more, are possible when you can collect information from across your organization and apply data science to it. But if you’re ready to make the jump to big data, you face a stark choice: should you use a pre-integrated “out-of-the-box” platform? Or should you download open-source Hadoop software and build your own?

Which path is right for your organization? Let’s take a closer look.


Assembling puzzle pieces

First, know that if you go DIY, there are many different components you’ll need to integrate with stock Hadoop: Hive, Yarn, MapReduce, and many more. (One of the leading Hadoop distributions includes 23 different software packages.) You’ll need to figure out which components—and which software versions—make sense for your deployment, and how to make them work together and with your environment.

That’s not a one-time job; all of those tools are constantly updated, so you’ll need to figure out how to support and maintain your solution on an ongoing basis. For these reasons, most organizations building their own platforms use third-party professional services to handle much of the heavy lifting.

So why choose the DIY path? You do end up with a solution that’s precisely tuned for what you want to do with it. Your IT department retains total control over the platform’s processes and capabilities. If you’re looking at a relatively small project (designed for a specific purpose, with specific data choices and interfaces) this can be a great choice. However, there can also be a downside to extensive customization: if you want to expand your platform in the future, it may be less flexible than a ready-made solution designed for multiple use cases.


Weighing costs

It can be tempting to assume that building your own platform, using off-the-shelf hardware and open-source software, is inherently less expensive than a pre-integrated solution. The numbers, however, don’t necessarily bear that out.

The sticker price of an integrated platform may be higher, but total cost of ownership is likely to be comparable, or even lower over the life of the solution than a DIY cluster. Consider: Any big data platform will require the same compute power, storage, and infrastructure, so hardware costs are likely comparable. But, if you’re going DIY, you should expect to spend several hundred thousand dollars on software, as well as installation and ongoing support from third-party professional services, all of which is included in a pre-integrated solution.

Cost differences can, however, become significant if you’re considering the cloud. A variety of pre-integrated solutions are now available as cloud-based services (or even hybrid services, where some data remains on premises). This model allows organizations to start adopting big data at a much lower upfront cost, much faster than building their own solution, or even deploying a full-scale pre-integrated solution on-premises.


Collecting and using data is not the same thing

It’s important to remember that data science takes more than just aggregating data in one place. There are many steps between collecting data and being able to use it.

Take a common example of extracting structured information from unstructured data, such as email. Here’s one way that could work: First, thousands of emails arrive in basic HTML. To extract meaningful insights, you now need to parse the documents, cleanse them, extract terms, define a meaningful vocabulary, etc.

Out-of-the-box solutions often provide prebuilt tools to manage job scheduling workflows alongside data collection to make your data analytics-ready. A more general-purpose pre-built platform is also likely to be flexible—allowing your developers to write programs using the language of their choice and be confident that they’ll work on any data in the system. So, it should be easy to create and continually update workflows around the data you’re collecting.

If you go DIY, make sure you’re infrastructure can handle all of the workflow processes around data collection, or that IT is willing to support them. And, be sure to design your custom solution to be as open as possible so you’re not limiting your options in the future.


Going from lab to production

One of the bigger risks in DIY projects comes when it is time to move from the lab to production. Here’s what can happen: You set up a demo Hadoop environment to show what you can do with it. Everyone is impressed, and you get the green light to move forward. But then, when it’s time to put it into production, you face some uncomfortable questions from IT: How will this fit into our operational workflow? How will you secure access? Is the data encrypted at rest? How will this tie into our identity infrastructure?

Enterprise IT takes many things for granted—that any database platform will have encrypted storage, integration with Active Directory, rigorous audit logging, and a means to define fine-grained access control policies. If your solution hasn’t checked all those boxes—none of which were necessary in the lab—it’s not going anywhere near your production network.

Unfortunately, stock Hadoop doesn’t offer great answers to those questions. Even basic encryption and AD integration is complicated, and default access control mechanisms are coarse-grained. There’s no mechanism to give different users different levels of access to the same data—for example, if your platform is serving customer service agents who need access to full records and analysts who are only authorized to see de-identified information.

Any production-ready big data platform needs all of these capabilities. So again, it’s a matter of weighing customization against flexibility. If you go DIY, you should expect to need a significant integration effort. But you will end up with a solution built specifically for your existing security, authentication and policy infrastructure.

If you go with an out-of-the-box solution, you’re getting a platform that’s built from the ground up to meet enterprise security and privacy requirements, including policy-based access control, encryption and auditing out of the box. Some can even dynamically generate different views of data for different users, such as presenting full views of records to some users and de-identified versions to others, on the fly. Just know that you may have to adapt some of your internal processes around a prebuilt platform.

Ultimately, the big data path you choose comes down to understanding your organization. Maybe you have unique requirements that demand a custom solution.  Maybe you’re addressing a limited set of questions, or have existing data collection processes and infrastructure that you don’t want to change. If so, a custom-built big data platform tailored precisely to your needs may be the best fit. But if that’s not the case—if big data is just one tool to support your core business strategy—a pre-integrated, enterprise-ready solution can offer a relatively fast, easy way to start unlocking the value of your data.

View the original article here.


Adam Lorant is the VP Product and Solutions and co-founder of PHEMI Systems, responsible for driving the the product vision and strategy. He works closely with leading healthcare research organizations, healthcare providers, and payer organizations to help them define and implement their big data strategies.

 

Posted in News Tagged with: ,