rhive — Victor Clark

Taming RHive

When your tech stack doesn't allow for Python, for whatever reason. Alternate title: First World Problems.

R logo use under GPL-2. Hive logo copyright of parent company.

So you want to connect to your big data warehouse using R. Why R? Maybe you're consulting for a team who have never heard of Python. Maybe you're working with a bunch of academics. Or maybe you're into developing with one-arm behind your back. Whatever your reason, I get it; we've all been there.

In talking with an expert in the field -- expert in that they worked for a company that provides a solution to managing the ecosystem that includes Hive -- the recommendation was to go with Python. Unfortunately, my particular project didn't have the bandwidth to add another language to the stack. One of the great things about working with a bunch of different technologies is picking up tricks that can make you apply to others. This is the solution I found.

For context, this project was using a certain big data company's 'quickstart' VirtualMachine image. Usually I prefer to build things from scratch, but for demo-ing the technology, this quickstart was what I better. I needed to access the data that lived in Hive through R. After taking a look at solutions other people offered, I came up dry -- there's a well-supported R library for Hadoop, but none of the projects at the time supported Hive. R isn't the best for creating pipe lines, but it does over a couple of database connectors, ODBC and JDBC; Hive is written in Java, so I decided JDBC would be the way to go. Some tinkering, and voilà! R was taking directly to Hive.

Below is a gist file to get you started. After loading the libraries into R (specifically RServer, if you're using the quickstart VirtualMachine image based on CentOS) , you will need to load the JARs so R can talk to Hive. Happy RHiving!

Embed Block

Add an embed URL or code. Learn more