Monday, April 1, 2013

BI/Analytics on Big Data/Cassandra: Vertica, Acunu and Intravert(!?)


As part of our presentation up at NYC* Big Data Tech Day, we noted that Hadoop didn't really work for us.  It was great for ingesting flat files from HDFS into Cassandra, but the map/reduce jobs that used Cassandra as input didn't cut it.  We found ourselves contorting our solutions to fit within the map/reduce framework, which required developer-level capabilities.  We had to add complexity into the system to do batch management/composition, and in the end the map/reduce jobs took too long to complete.

Eventually, we swapped out Hadoop for Storm.  That allowed us to do real-time cumulative analytics.  And most recently, we converted our topologies to Trident.  Handling all CRUD operations through Storm allowed us to perform roll-up metrics by different dimensions using Trident State.  (Additionally, we can write to wide-rows for indexing, etc.)

This is working really well, but we are seeing increasing demand from our data scientists and customers to support "ad hoc" dimensional analysis, dashboards, and reporting.  Elastic Search keeps us covered on many of the ad hoc queries, but aside from facets, it has little support for real-time dimensional aggregations, and no support for dashboards and reports.

We turned to the industry to find the best of breed.  With some help from others that have traveled this road, (shout out to @elubow), we settled on Vertica, Infobright and Acunu as contenders.  I quickly grabbed VM's from each of them and went to work.

WARNING: What I'm about to say is based on a few days experimentation, and largely consists of initial first impressions.  It has no basis on real production experience. (yet =)

First up was Acunu.  Although each of the VMs functioned as an appliance, when logging into the VM and playing around with things, we were most at home with Acunu.  Acunu is backed by Cassandra.  Having C* installed and running as the persistence layer was like having an old friend playing wingman on an initial first date.  (they can bail you out if things start going south =)

Acunu had a nice REST API and a simple enough web-based UI to manage schemas and dimensions.  Within minutes, I was inserting data from a ruby script and playing around with dashboards.... until something went wrong and the server starting throwing OoM's.  After a restart, things cleared up, but it left me questioning the stability a bit.  (once again, this was a *single* vm running on my laptop, so it wasn't the most robust environment)

Next, I moved on to Vertica.  From a features and functions point of view, Vertica looked to be leaps and bounds ahead.  It had sophisticated support for R, which would make our data scientists happy.  It also has compression capabilities, which will make our IT/Ops guys happy.  And it looked to have some sophisticated integration with Hadoop, just in case we ever wanted/needed to support deep analytics that could leverage M/R.

That said, it was far more cumbersome to get up and running, and felt a bit like I went backwards in time.  I couldn't find a REST API. (please let me know if someone has one for Vertica)  So, I was left to go through the hoop-drill of getting a JDBC client driver, which was not available in public repos, etc.  When using the admin tool provided on the appliance, I felt like I was back in middle school (early 90's) installing linux via an ANSI interface on an Intel 8080.  In the end however, I grew accustomedto their client (vsql) and was happily hacking away over the JDBC driver and it felt fairly solid.

Although we are still interested in pursuing both Acunu and Vertica, both experiences left me wanting.   What we really want is a fully open-source solution (preferably apache license) that we are free to enhance, supplement, etc.... with optional commercial support.

That got me thinking about Edward Capriolo's presentation on Intravert.   If I boil down our needs into "must-haves" and "nice-to-haves", what we really *need* is just an implementation of Rainbird.  (http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011)

AS AN ASIDE:
Does anyone know what happened to Rainbird?  I've been trying to get the answer, to no avail.
http://www.youtube.com/watch?v=84k7o4GdkQg

Now, time for crazy talk...
Intravert provides a slick REST API for CRUD operations on Cassandra.  As I said before, I'm a *huge* REST fan.  It provides the loose-coupling for everything in our polyglot persistence architecture.    Intravert also provides a loosely coupled eventing framework to which I can attached handlers.   What if I implemented a handler, that took the CRUD events, and updated additional column families with the dimensional counts/aggregations???    If I then combine that with a javascript framework for charting, how far would that get me?  (60-70% solution?)

To be clear, I'm not bashing Vertica or Acunu.  Both have solid value propositions and they are both contenders in our options analysis.  I'm just mourning the fact that there seems to be no good open-source solution in this space like there are in others.  (Neo4j/TitanDB for graphs, Elastic Search/SOLR for search, Kafka/Kestrel for queueing, Cassandra for Storage, etc.)

We are also considering Druid and Infobright, but I haven't gotten to them yet:
https://github.com/metamx/druid

Please don't bash me for early judgments.
I'm definitely interested in hearing people's thoughts.



No comments: