Thursday, January 30, 2014

Book Review : Web Crawling and Data Mining with Apache Nutch

In our space, we found that some of the most current healthcare related information is found on the internet.  We harvest that information as input to our healthcare masterfile.  Our crawlers run against hundreds of websites.  We have a fairly large web harvester, which is what drove me to explore Nutch with Cassandra: Crawling the web with Cassandra.

When Web Crawling and Data Mining with Apache Nutch came out, I was eager to have a read.   The first quarter of the book is largely introductory.  It walks you through the basics of operating Nutch and the layers in the design: Injecting, Generating, Fetching, Parsing, Scoring and Indexing (with SOLR).

For me, the book got a bit more interesting when it covered the Nutch Plugin architecture.  HINT: Take a look at the overall architecture diagram on Page 34 before you start reading!

The book then covers deployment and scaling.   A fair amount of time is spent on SOLR deployment and scaling (via sharding), which in and of itself may be valuable if you are a SOLR shop.   (not so much if you are Elastic Search (ES) fans -- in fact, it was one of the reasons why we moved to ES ;)

About midway through the book, the real fun starts when the author covers how to run Nutch with/on Hadoop.  This includes detailed instructions on Hadoop installation and configuration.  This is followed by a chapter on persistence mechanisms, which uses Gora to abstract away the actual storage.

Overall, this is a solid book, especially if you are new to the space and need detailed, line by line instructions to get up and running.  To kick it up a notch, it would have been nice to have a smattering of few use cases and real-world examples, but given the book is only about a hundred pages, it does a good job of balancing utility with color commentary.

The book is available from PACKT here:

Wednesday, January 29, 2014

Looking for your aaS? (IaaS vs. PaaS vs. SaaS vs. BaaS)

Our API is getting a lot of traction these days.  We enable our customers to perform lookups against our masterfile via a REST API.  Recently, we've also started exposing our Master Data Management (MDM) capabilities via our REST API.  This includes matching/linking, analysis, and consolidation functionality.  A customer can send us their data, we will run a sophisticated set of fuzzy matching logic attempting to locate the healthcare entity in our universe (i.e. "match"). We can then compare the attributes supplied by our customers with those on the entity in our universe, and decide which are the most accurate attributes. (i.e. "consolidate")  Once we have the consolidated record, we run analysis against that record to look for attributes that might trigger an audit.

I've always described this as a Software as a Service (SaaS) offering, but as we release more and more of our MDM capabilities via the REST API, it is beginning to feel more like Platform as a Serivce (PaaS).  I say that because we allow our tenants/customers/clients to deploy logic (code) for consolidation and analytics.  That code runs on our "platform".

That got me thinking about the differences between Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Back-end-as-a-Service (BaaS), and Software as a Service (SaaS).  Let's first start with some descriptions.  (all IMHO)

IaaS: This service is the alternative to racks, blades and metal.  IaaS allows you to spin-up new virtual machines, provisioned with an operating system and potentially a framework.  From there you are on your own.  You need to deploy your own apps, etc.  (BYOAC == Bring your own Application Container)

PaaS: This service provides an application container.   You provide the application, potentially built with a provider's tools/libraries, then the service provisions everything below that. PaaS adds the application layer on top of IaaS. (BYOA == Bring your own Application)

SaaS: These services exposes specific business functionality via an interface.  Consumers are typically consuming the services off-premise over the web.  In most cases, SaaS refers to some form of web services and/or user interface.   (Either no BYO, or BYOC == Bring your own Configuration)

BaaS:  For me, there is a blurred line between BaaS and SaaS.  From the examples I've seen, BaaS often refers to services consumed by mobile devices.  Often, the backend composes a set of other services and allows the mobile application to offload much of the hard work. (user management, statistics, tracking, notifications, etc)  But honestly, I'm not sure if it is the composition of services, the fact that they are consumed from mobile devices, or the type of services that distinguishes BaaS from SaaS.  (ideas anyone?)

Of course, each one of these has pros/cons, and which one you select as the foundation for your development will depend highly on what you are building.  I see it as a continuum:

The more flexibility you need, the more overhead you have to take on to build out the necessary infrastructure on top of the lower level services.  In the end, you may likely have to blend of all of these.

We consume SaaS, build on PaaS (salesforce),  leverage IaaS (AWS), and expose interfaces for both PaaS and SaaS!  

Any which way you look at it, that's a lot of aaS!

Tuesday, January 28, 2014

Mesos on Mac OS X Mavericks (SOLVED: "Could not link test program to Python")

Continuing on my expedition with Scala and Spark, I wanted to get Mesos working (underneath of Spark).  I ran into a couple hiccups along the way...

First, download Mesos:

Unpack the tar ball, and run "./configure".
If you are running Mavericks, and you've installed Python using brew, you may end up with:

configure: error:
  Could not link test program to Python. Maybe the main Python library has been
  installed in some non-standard library path. If so, pass it to configure,
  via the LDFLAGS environment variable.
  Example: ./configure LDFLAGS="-L/usr/non-standard-path/python/lib"
   You probably have to install the development version of the Python package
   for your distribution.  The exact name of this package varies among them.

It turns out there is a bug in python that prevents Mesos from properly linking. Here is the JIRA issue on the mesos project:

To get around this bug, I needed to downgrade python.

With brew, you can use the following commands:
bone@zen:~/tools/mesos-0.15.0-> brew install homebrew/versions/python24
bone@zen:~/tools/mesos-0.15.0-> brew unlink python
bone@zen:~/tools/mesos-0.15.0-> brew link python24
After that, the configure will complete, BUT -- the compilation will fail with:
In file included from src/
./src/glog/stl_logging.h:56:11: fatal error: 'ext/slist' file not found
# include 
For this one, you are going to want to get things compiling with gcc (instead of clang). Use the following:
brew install gcc47
rm -rf build
mkdir build
cd build
CC=gcc-4.7 CXX=g++-4.7 ../configure
After that, you should be able to "sudo make install" and be all set. Happy Meso'ing.

Monday, January 27, 2014

Scala IDE in Eclipse (with 2.9.x and Juno... or not)

I'm taking the plunge into Scala to determine if it has any benefits over Java.   To motivate that, I decided to play around with Spark/Shark against Cassandra.  To get my feet wet, I set out to run Spark's example Cassandra test (and perhaps enhance it to use CQL).

First, I needed to get my IDE setup to handle Scala.  I'm an Eclipse fan, so I just added in the Scala IDE for Eclipse. (but make sure you get the right scala version! see below!)

Go to Help->Install New Software->Add, and use this url:

Race through the dialog boxes to install the plugin, which will require you to restart.

For me, I was working with a Java project to which I wanted to add the CassandraTest scala class from Spark.  If you are in the same situation, and you have an existing Java project, you will need to add the Scala nature in Eclipse.  Do this by right-clicking on the project, then Configure->Add Scala Nature.

At this point, you can start the Scala interpreter by right-clicking on the project, then Scala->Create Scala Interpreter.

I was happy -- for a moment.  I was all setup, but Eclipse started complaining that certain jar files were "cross-compiled" using a different version of Scala: an older version, 2.9.x.  Unfortunatley, I had Storm in my project, which appeared to be pulling in files compiled with 2.9.x.

So, I uninstalled the Scala IDE plugin because it appeared to work only with 2.10.x.  I needed to downgrade to an older version of the Scala IDE to get 2.9.x support.  That forced me on to an experimental version of Scala IDE  because I needed 2.9.x support in Juno.  Unfortunately, after re-installing the old version, I lost the ability to add the Scala nature. =(


I decided to go hack it at the command-line.  I followed this getting started guide to add Scala to my maven pom file.  That worked like a champ.  And I could run the CassandraTest.

So, at this point, I'm editing the files in Eclipse, but running via command-line.  I'm not sure Scala IDE will bother supporting 2.9.x in Juno or Kepler, because they seemed to have moved on.  But if anyone has any idea how to get Scala IDE with 2.9.x support in Juno, I'm all ears. (@jamie_allen, any ideas?)

Thursday, January 16, 2014

Jumping on the CQL Bandwagon (a tipping point to migrate off Astyanax/Hector?)

Its been over a year since we started looking at CQL. (see my blog post from last October)

At first we didn't know what to make of CQL.   We were heavily invested in the thrift-based APIs (Astyanax + Hector).  We had even written a REST API called Virgil directly on top of Thrift (which enabled the server to run an embedded Cassandra).  

But there was a fair amount of controversy around CQL, and whether it was putting "SQL" back into "NoSQL".  We took a wait and see approach to see how much CQL and the thrift-based API diverged.  The Cassandra community pledged to maintain the thrift layer, but it was clear that Datastax was throwing its weight behind the new CQL java driver.  It was also clear that new-comers to Cassandra might start with CQL (and the CQL java-driver), especially if they were coming from a SQL background.

Here we are a year later, and with the latest releases of Cassandra, (IMHO) we've hit a tipping point that has driven this C* old-timer to begin the migration to CQL.   Specifically, there are three things that CQL has better support for:

Lightweight Transactions: These are conditional inserts and updates.  In CQL, you can add an additional where clause on the end of a statement, which is first verified before the upsert occurs. This is hugely powerful in a distributed system, because it helps accommodate distributed reads-before-writes.  A client can add a condition which will prevent the update if it was working with stale information. (e.g. by checking a timestamp or checksum and only updating if that timestamp or checksum hasn't changed)

Batching:  This allows the client to group statements.  The batch construct can guarantee that either all the statements will succeed, or all will fail.  Even though it doesn't provide isolation, meaning other clients will see partially committed batches, this is still a very important construct when creating consistent systems that scale because you end up batching in the client to reduce the database traffic.

Collections: When you do enough data modeling on top of Cassandra, you end up building on top of the row key / sorted column key structure using composite columns.  And although it is amazing what you can accomplish with that simple structure, a lot of effort is spent marshaling in and out of those primitive structures.  Collections offers a convenient translation layer on top of those primitives, which simplifies things.  You can always drop down into the primitives, when need be, but sometimes its nice to have a simple list, map, or set at hand.

Now -- don't get me wrong.  I'm still a *huge* Astyanax fan, and it still provides some convenience capabilities that AFAIK are not yet available in CQL.  (e.g. the Chunked Object Store)  But as we guessed a while back, it looks like CQL will offer better support for newer C* features.

SOOO ----
I've started on a rewrite of Virgil that offers up CQL capabilities via REST.  I'm calling the project memnon.  You can follow along on github as I build it out.

Additionally, I started rewriting the Storm-Cassandra bolt/state mechanisms to ride on top of CQL.  You can see that action on github as well.

More to come on both of those.

Tuesday, January 14, 2014

ElasticSearch from AngularJS (fun w/ elasticsearch-js!)

We've recently switched over to AngularJS (from ExtJS).  And if you've been following along at home, you know that we are *HUGE* ElasticSearch fans.  So today, I set out to answer the question, "How easy would it be to hit Elastic Search directly from javascript?"  The answer lies below. =)

First off, I should say that we recently hired a rock-star UI architect (@ddubya) that has brought with him Javascript Voodoo the likes of which few have seen.  We have wormsign... and we now have grunt paired with bower for package management.  Wicked cool.

So, when I set out to connect our Angular App to Elastic Search, I was pleased to see that Elastic Search Inc recently announced a javascript client that they will officially support.  I quickly raced over to and found a rep listed in their registry...

I then spent the next two hours banging my head against a wall.

Do NOT pull the git repo via bower even though it is listed in the bower registry!  There is an open issue on the project to add support for bower.  Until that closes, the use of Node's require() within the angular wrapper prevents it from running inside a browser.  Using browserify, the ES guys kindly generate a browser compatible version for us.   So *use* the browser download (zip or tar ball) instead!

Once you have the download, just unzip it in your app.  Load the library with a script tag:
<script src="third_party_components/elasticsearch-js/elasticsearch.angular.js"></script>
That code registers the elastic search module and creates a factory that you can use:
angular.module('elasticsearch', [])
  .factory('esFactory', ['$http', '$q', function ($http, $q) {

    var factory = function (config) {
      config = config || {};
      config.connectionClass = AngularConnector;
      config.$http = $http;
      config.defer = function () {
        return $q.defer();
      return new Client(config);
You can then create a service that uses that factory, creating an instance of a client that you can use from your controller, etc:
  .service('es', function (esFactory) {
  return esFactory({
    host: 'search01:9200'
Then, assuming you have a controller defined similar to:
  .controller('searchController', ['$window', '$scope', 'es', function ($window, $scope, es) {
The following code, translated from their example, works like a champ!{
  requestTimeout: 1000,
  hello: "elasticsearch!"
}, function (error) {
  if (error) {
    console.error('elasticsearch cluster is down!');
  } else {
    console.log('All is well');
From there, you can extend things to do much more... like search! =)

Kudos to the Elastic Search Inc crew for making this available.

Thursday, January 9, 2014

WTF is an architect anyway?

In full disclosure, I'm writing this as a "Chief" Architect (I can't help but picture a big headdress), and I've spent the majority of my career as an "architect" (note the air quotes).  And honestly, I've always sought out opportunities that came with this title.  I think my fixation came largely from the deification of term in the Matrix movies.

But in reality, titles can cause a lot of headaches, and when you need to scale an organization to accommodate double digit growth year over year, "architects" and "architecture" can help... or hurt that growth process.  Especially when architecture is removed/isolated from the implementation/development process, we know that ivory-tower architecture kills.

This day and age however, a company is dead if it doesn't have a platform.  And once you have a critical number of teams, especially agile teams that are hyper-focused only on their committed deliverables, how do you cultivate a platform without introducing some form of architecture (and "architects")?

I've seen this done a number of ways.  I've been part of an "Innovative Architecture Roadmap Team"' an "Enterprise Architecture Forum", and even a "Shared Core Services Team".  All of these sought to establish and promote a platform of common reusable services.  Looking back, the success of each of these was directly proportional to the extent to which the actual functional development teams were involved.

In some instances, architects sat outside the teams, hawking the development and injecting themselves when things did not conform to their vision.  (Read as: minimal team involvement).  In other cases, certain individuals on each team were anointed members of the architecture team.  This increased involvement, but was still restricted architectural influence (and consequently buy-in) to the chosen few.   Not only is this less than ideal, but it also breeds resentment.  Why are some people anointed and not others?

Consider the rock-star hotshot developer that is right out of college.  He or she may have disruptive, brilliant architectural insights because dogma hasn't found them yet.  Unfortunately, this likely also means that they don't have the clout to navigate political waters into the architectural inner circle.  Should the architecture suffer for this?  Hell no.

So, what do we do?  I suggest we change the flow of architecture.  In the scenarios I've described thus far, architecture was defined by and emanated from the architectural inner circle.  We need to invert this.  IMHO, an architectural approach that breeds innovation is one that seeks to collect and disseminate ideas from the weeds of development. 

Pave the road for people that want to influence and contribute to the architecture and make it easy for them to do so.  In this approach, everyone is an architect.  Or rather, an architect is a kind of person: a person that wants to lift their head up, look around, and contribute to the greater good.

That sounds a bit too utopian.  And it is.  In reality, architectural beauty is in the eye of the beholder, and people often disagree on approach and design.  In most cases, it is possible to come to consensus, or at least settle on a path forward that provides for course correction if the need should arise.  

But there are cases, when that doesn't happen.  In these cases, I've found it beneficial to bring a smaller crew together, to set aside the noise, leave personal passions aside, and make a final call. Following that gathering, no matter what happened in the room, it is/was the job of those people to champion the approach.

In this capacity, the role of "architects" is to collect, cultivate and champion a common architectural approach.  (pretty picture below)

To distinguish this construct from pre-conceived notions of "architecture teams" and "architects" (again, emphasis on the air quotes), I suggest we emphasize that this is a custodial function, and we start calling ourselves "custodians".

Then, we can set the expectation that everyone is an architect (no air quotes), and contributes to architecture.  Then, a few custodians -- resolve stalemates, care for, nurture, and promote the architecture to create a unified approach/platform.

I'm considering changing my title to Chief Custodian.  I think the janitorial imagery that it conjures up is a closer likeness anyway.   Maybe we can get Hollywood to come out with a Matrix prequel that deifies a Custodian. =)