Tuesday, September 11, 2012

Compound/Composite Keys: Connecting the dots between CQL3, Astyanax and Hector


I still don't know where I stand on the terminology debate, but I know things can get confusing if you are accessing your database from CQL as well as Java.  They are very different views of the world.  Hopefully this article can bridge that gap and explain the different views with respect to composite keys.

Before we get started, we should get our terminology straight.  Cassandra has a CompositeType data type.  This is *not* what we'll be discussing.  Instead, we're discussing compound column names.  CQL3 calls these compound keys, while Astyanax and Hector call them Composites.

Underneath the hood, Cassandra's storage model hasn't changed to accomodate composite keys.  Composite keys are really just a fancy bit-wise concatenation of the key components prior to storage and/or retrieval.  The storage model remains, rowkey -> columnname -> value.

Thus, you have to bend your brain a bit when seeing Cassandra through CQL.  Since multiple "columns" as viewed by CQL, will become a fewer number of columns in the physical storage.  And since the Java APIs provide direct access to the storage model, it will appear as one column through the Java APIs (at least initially).

Let's consider the following CQL as our example:

CREATE TABLE fishblogs (
  userid varchar,
  when timestamp,
  fishtype varchar,
  blog varchar,
  image blob,
  PRIMARY KEY (userid, when, fishtype)
);

Although its not apparent from the CQL, the PRIMARY KEY statement uses the ordering of the parameters to map the schema to Cassandra's storage model.  Specifically, the five columns declared above will result in two columns when an insert statement is run.  Here's how Cassandra maps the schema to storage.
row_key => userid  (since it is the first key declared)
column_name_1 => when + fishtype + blog  (since when and fishtype are keys, and blog is a value it needs to store)
column_name_1 => when + fishtype + image (since when and fishtype are keys and image is a value it needs to store)
If I can paraphrase the mapping... the first primary key becomes the row key.  Subsequent primary keys comprise the prefix components of the composite column name.  One column is generated for each non-primary key column.   The composite key combines the common prefix (comprising the primary keys) plus the column name for the value.

If we perform the following insert:

cqlsh:hms_data> insert into fishblogs (userid, when, fishtype, blog, image) values ('boneill42', 1343925155443, 'CATFISH', 'Caught the big one.', '632a726f636b73');
cqlsh:hms_data> select * from fishblogs;
 userid    | when                     | fishtype | blog                | image
-----------+--------------------------+----------+---------------------+----------------
 boneill42 | 2012-08-02 12:32:35-0400 |  CATFISH | Caught the big one. | 632a726f636b73

CQL presents this as five columns, but under the hood there are really only two (column name -> value):
[8?.vsCATFISHblog]->[Caught the big one.]
[8?.vsCATFISHimage]->[c*rocks]

In the above, you can see the concatenated keys in the column name.  This is the raw output you receive from a query using Astyanax, when you don't have composite keys in place.  This the abbreviated code:
ColumnFamily columnFamily = new ColumnFamily(columnFamilyName, StringSerializer.get(), StringSerializer.get());
OperationResult> result = this.keyspace.prepareQuery(columnFamily).getKey(rowKey).execute();
result.getResult(); 

In order to have the same view as CQL, we need to map the composite key to a class and map the fields of that class.  For example,
public class FishBlog {
    @Component(ordinal = 0)
    public long when;
    @Component(ordinal = 1)
    public String type;
    @Component(ordinal = 2)
    public String field;
    
    public FishBlog() {
    }
}
What seems a little unnatural about this mapping is the "field" member variable.  Ideally, we'd prefer two member variables: "image" and "blog".  But since the data is spread across multiple columns, each containing the value for a separate field, the class has to be flexible enough to carry either (but not both).  

We can then use the following Astyanax code to retrieve data:
AnnotatedCompositeSerializer entitySerializer = new AnnotatedCompositeSerializer(FishBlog.class);
ColumnFamily columnFamily = new ColumnFamily(columnFamilyName, StringSerializer.get(), entitySerializer); OperationResult> result = this.keyspace.prepareQuery(columnFamily).getKey(rowKey).execute();
return result.getResult();


This results in the following ouput, when accessing the result set, which contains FishBlog's.
fishBlog.when=>[Thu Aug 02 12:32:35 EDT 2012]
fishBlog.type=>[CATFISH]
fishBlog.field=>[image]
fishBlog.value=>[c*rocks]
Clear as mud?  Now, to complete the picture I'll show the same code for Hector.  Here is the fetch:
ColumnFamilyTemplate template = new ThriftColumnFamilyTemplate(this.keyspace, columnFamilyName, new StringSerializer(), new CompositeSerializer());
return template.queryColumns(rowKey);
Then, what may feel a little weird is that the Composite object, which contains the component keys for your object, comes from the column name:
for (Composite columnName : columns.getColumnNames()){
   FishBlog fishBlog = new FishBlog(columnName);
   LOG.debug("fishBlog.when=>[" + new Date(fishBlog.getWhen()) + "]");
   LOG.debug("fishBlog.type=>[" + fishBlog.getType() + "]");
   LOG.debug("fishBlog.field=>[" + fishBlog.getField() + "]");
   LOG.debug("fishBlog.value=>[" + columns.getString(columnName) + "]");

}
As you can see from above, I recommend wrapping the composite in an object that provides easy accessors. (which makes it feel like Astyanax)  Here is my FishBlog object for Hector:
public class FishBlog {
    private Composite composite;
    public FishBlog(Composite composite) {
        this.composite = composite;
    }
    public long getWhen() {
        return composite.get(0, new LongSerializer());
    }
    public String getType() {
        return composite.get(1, new StringSerializer());
    }
    public String getField() {
        return composite.get(2, new StringSerializer());
    }
}
OK -- hopefully that demonstrates how the five columns from the CQL perspective translates into two physical columns that can be accessed via Hector or Astyanax, resulting in two entries in the result set mapped to java objects that you can use to get at the values.  I'll try to get the code for these examples up on github.