Maybe Don’t Try This at Home: The Pains of DIY Hadoop

data warehouse


So you want to get started with big data. Excellent! It’s easy to see why. From 360-degree customer views built from past transactions and social media feeds, to the ability to analyze thousands of processes and identify inefficiencies, to uncovering brand new connections by bringing together all your formerly disparate data sources, big data means big rewards for organizations across industries.

For these reasons, the why is obvious. It’s the how that often proves more elusive. How do you get started? How do you unlock all these big data benefits? How do you get the best data warehouse for you and your company? Should you purchase a pre-integrated “out of the box” solution, packaged and ready to go? Or should you build it on your own and opt for Do-It-Yourself, open-source Hadoop software?

The urge to DIY can be tempting. For one, you’ll end up with a tailored solution, made precisely for your needs. Plus, the steady advance of open-source big data software means that the tools you need to develop a custom data warehouse won’t cost you an arm and a leg.

That all sounds great, but DIY solutions aren’t so simple. You may need to ask a few more questions.

1. How are you going to put it together? What other components will you be integrating with Hadoop? Yarn? Hive? MapReduce? Which software versions? How will they all work together? (Will they all work together?) And then, once you’ve picked out all the components, who will keep your solution running? Each tool is constantly being updated. You’ll need ongoing expertise to maintain your custom solution – expertise that doesn’t often come cheap.

2. Who is going to put it together? (And how much will they cost?) For exactly the reasons specified in #1, organizations often turn to third-party professional services to do the heavy lifting. As we’ve noted, expertise doesn’t often come cheap. These third-party engagements can become costly entanglements, reducing the cost-savings you might expect from an open-sourced solution. You can try and hire the specialized skillsets you need yourself, but that’s not necessarily any less expensive – and, at that point, are you building a data warehouse or are you building your own big data start-up inside your organization?

3. How will you manage the workflow and actually use your data? Let’s say you’ve assembled all your parts. You’ve got a functional warehouse to aggregate your data. Now, you can collect your organization’s data – but can you use it?

Take the common example of extracting structured information from unstructured data, such as email. First, thousands of emails arrive in basic HTML. To extract meaningful insights, you now need to parse the documents, cleanse them, extract terms, and so on.

A DIY solution means managing all of those steps on your own and scheduling the job workflow yourself, outside of your traditional data warehouse. That’s another ongoing, custom development engagement and one that will monopolize your organization’s time and resources.

4. Does it meet IT’s standards? You’ve got a great, working Proof of Concept running in the lab and everyone’s impressed. However, bringing a demo into production can be challenging. Does your DIY warehouse meet your IT department’s standards? How will the warehouse fit into their operational workflow? How will you secure access? Is data encrypted at rest? How will this tie into our identity infrastructure?

If you’re only thinking about the robust privacy, security, and governance needs your big data solution needs now – and whether it has the ability to adapt to your organization’s evolving privacy, security, and governance needs in the future – it’s probably too late. It’s hard to rework your DIY solution to meet your organization’s pain points retroactively. You need to start with governance and security in mind at the very beginning in order to effectively avoid future problems.

Especially since Enterprise IT departments take many features for granted. They naturally assume that any database platform will have encrypted storage, Active Directory integration, rigorous audit logging, and the ability to definite fine-grained access control policies. If your solution doesn’t check all these boxes, it’s not going anywhere near the production network.

5. Can you bring your big data warehouse into production? After digesting all those questions an IT department would demand of you, you may be wondering if it’s even possible to meet those security standards on your own. And, unfortunately, creating the necessary security features with a DIY solution isn’t easy. Stock Hadoop doesn’t have any quick-fix answers for you.

Basic encryption and AD integration are extremely complicated. The default access control mechanisms you can get your hands on are very coarse-grained; you won’t be able to give different users different levels of access to the same data. For example, if your platform serves customer service agents who need access to full records and analysts who are only authorized to see de-identified information, those default mechanisms aren’t going to cut it.

So maybe it’s time to explore some other options, before you commit yourself to a complicated, DIY warehouse. What about a turnkey solution?

Posted in Blog Tagged with: , , ,