It’s been a bit of a journey since I first set up this blog and this post. I got stuck in a following a number of rabbit holes and have come to the conclusion that the best place for most people to start their journey of learning Big Data/Hadoop/Hive is with the Hortonworks Sandbox that you can download from here.
Let me start with some obvious reasons to use this pre-built virtual machine, and a couple reasons I decided to build my own cluster instead. And finally, why I came back to this spot.
What could be easier than an already pre-built, pre-configured virtual machine? I’m a big fan of pre-built virtual machines. And Hortonworks has also added test data and tutorials. It’s a great place to start.
Problems. First, it takes 8gig ram to run. Not run really well, just barely enough to run. Even so, my laptop only has 8 gig of ram. You will need a machine with 16gig of ram to run this virtual machine. More is better, as always. You can’t get around this by building a cluster of smaller vm’s. You simply can’t get Hortonworks Hadoop and Hive up and running on such a limited machine.
So I went out and bought a pair of old enterprise data center servers. Yes, I have a very forbearing wife and now my living room sounds like a data center. At least Winter is upon us in Chicago and the heat generated is an added bonus. But I now have 2 servers each with 8 cores and 32gig of ram. One of these is actually enough.
The next “why not” was my initial displeasure trying to connect to the Hortonworks vm from “outside”…using tools like MicroStrategy and Tableau and other sql tools. I had trouble ssh’ing into the vm. What are the passwords? How is the vm setup? I don’t like black boxes. Mind you, there were answers to all of these needs, but then we have the next enticement.
Hadoop is all about distributed computing. How can you learn about distributed computing if you are using a single node “cluster”? And I had these new (to me) powerful servers and off I went to build my own cluster.
I took a detour looking into “containers” and Docker. I was under the impression that containers would let me over load the server memory. In theory you could. A virtual machine sets aside whatever memory you give it…and keeps it all for itself whether it’s using it or not. With containers…they could share a memory space and the set of containers would think they have more memory than the server has…and as long as they aren’t all trying to use the memory at the same time, you are good to go. And how could my development server actually use all that much memory for my needs? Well, it was a rabbit hole. A fun rabbit hole, but nonetheless…I never achieved getting a working set of Docker containers for a Hadoop cluster. Not yet…and I set it aside and went back to vm’s.
My Vagrant script approach that worked fine for my Windows 10 laptop, didn’t work on my Windows Server 2016 system. Man was THAT a challenge to debug. Take all the normal difficulties debugging an install of any server software that doesn’t “just work the first time”…and multiply it by the number of nodes in your cluster. Did I need a DNS server? Did I have the wrong version of Python? Etc. Etc. The answer was the latest version of Ambari (2.6 at the time). And I got my cluster up. But keeping the d@@#$$ thing up was a whole different story. I finally grew tired of whac-a-mole restarts of my services and looked back at the Hortonworks Sandbox vm.
With a little google-sluething, I found the answers to the “how to I connect to this vm” and what the logins were and how to SSH. With my plentiful ram on my new (very old) server, I upped the VM ram to 16gig. If you have more ram to spare, more is better.
Now, onto the next lessons…working with Hadoop/Hive using the HDP Sandbox.