All the code examined here below is available to you on this GitHub repository: https://github.com/codesandnotes/cassandra-cql3-pagination
Checkout the Maven-based project in your favorite IDE and run the tests! The project uses the excellent Cassandra Unit, so don’t even need to pollute your local Cassandra DB with unwanted keyspaces and tables.
With that said… Let’s tackle Cassandra pagination!
On any SQL database, paginating a large set of results might be done with something like this:
SELECT * FROM MY_TABLE OFFSET 10 LIMIT 10;
With NoSQL databases though, it can be a little trickier.
With Cassandra 2.0, a very useful automatic paging feature has been added in the Java Driver. It allows you to gradually fetch your query results through Statement‘s setFetchSize():
Statement query = new SimpleStatement("select * from cql3_pagination.simple_uuid_pk"); query.setFetchSize(5); ResultSet resultSet = embeddedCassandra.session().execute(query); List<Row> page1 = new ArrayList<>(); Iterator<Row> iterator1 = resultSet.iterator(); while (resultSet.getAvailableWithoutFetching() > 0) { page1.add(iterator1.next()); }
You can find a full example of automatic paging here.
But what happens if you are implementing a stateless web application (a REST API for example)? Should you really try persisting your ResultSet in a logged user’s session object? What if the user is not even logged?
A token of appreciation
What we want is to fetch a defined results from the database starting from a precise row.
We could memorize the primary key of the last row from the last batch of results, then we query Cassandra to fetch us the next batch of rows starting after that last key. Tough luck, we’re all probably using the Murmur3Partitioner. That means that the rows will be ordered by the hash of their values, not by the values themselves.
Thankfully, CQL brings us a token() function so we can page through our results in a sequential way:
select * from cql3_pagination.simple_string_pk where token(id) > token('lastIdFromPreviousPage') limit 5
The code below, taken from this test class, shows a typical exchange between a client and a server application:
List<Row> page1 = embeddedCassandra.session().execute( QueryBuilder.select().from(CASSANDRA_KEYSPACE, TABLE).limit(5) ).all(); String lastIdInPage1 = page1.get(4).getString("id"); List<Row> page2 = embeddedCassandra.session().execute( "select * from cql3_pagination.simple_string_pk where token(id) > token('" + lastIdInPage1 + "') limit 5" ).all();
Although the “greater than” sign might let you think otherwise, the pages of results will not be sorted by their primary key, even if the primary key is a string or an integer. The token() function allows CQL to follow the sequence of primary keys, not to sort them!
Here’s another example from this test code:
List<Row> page1 = embeddedCassandra.session().execute( QueryBuilder.select().from(CASSANDRA_KEYSPACE, TABLE).limit(5) ).all(); String lastBookInPage1 = page1.get(4).getString("book"); int lastLineInPage1 = page1.get(4).getInt("line"); List<Row> page2 = embeddedCassandra.session().execute( "select * from cql3_pagination.compound_pk where token(book) >= token('" + lastBookInPage1 + "') and line > " + lastLineInPage1 + " limit 5 allow filtering" ).all();
In this example we display stored lines of two famous books. We use two conditions to page our results: the one on the “book” primary key requires the usage of token(), but the one on the “line” column is a cluster key, so it can be used “as-is”.
Furthermore, cluster keys are sorted! In our example we rely on the natural ascending ordering of the “line” column, which allows us to read the book’s lines in the right order. Had we needed to sort in descending order (because we like reading books backwards!) we could have specified the order directly in the table definition using WITH CLUSTERING ORDER BY (line DESC) or by using the ORDER BY clause on our SELECT statement.
What about QueryBuilder?
Yeah, what about it.
So far QueryBuilder proposes a static method that allows you to get the token of a column name, the equivalent of writing “token(book)” for example.
And then you can get the token of the primary key value by… No, wait, you can’t. I mean, if there’s a simple way to obtain a primary key’s token from the Java API, I haven’t found it yet (but if you know of one, please put an end to my misery and post it in the comments below. Or even better, add a test to that GitHub repo!).
Hopefully we are developers! We don’t fear problems, we solve them!
When one uses QueryBuilder‘s eq() or gt() to specify a clause, the API actually creates a matching implementation of a new Clause object which allows the builder to… build its query in CQL.
So I merely implemented my own version of the abstract Clause class, which I called the TokenClause. By overriding the appendTo() method, this Clause implementation simply wraps the CQL token function around both column name and value:
@Override void appendTo(StringBuilder sb, List<Object> variables) { Utils.appendName(QueryBuilder.token(name), sb).append(operation); appendValue(value, sb, variables); } static StringBuilder appendValue(Object value, StringBuilder sb, List<Object> variables) { if (variables == null || !Utils.isSerializable(value)) return Utils.appendValue(value, sb); sb.append("token(?)"); variables.add(value); return sb; }
Now it becomes possible to apply the token() functionality with the QueryBuilder to achieve pagination. Here’s how it can be done for previous our “books” test:
QueryBuilder.select().from(CASSANDRA_KEYSPACE, TABLE).allowFiltering() .where(new TokenClause("book", ">=", lastBookInPage1)) .and(QueryBuilder.gt("line", lastLineInPage1)).limit(5)
This is QueryBuilder‘s equivalent of CQL’s:
select * from cql3_pagination.compound_pk where token(book) >= token('lastBookInPage1') and line > lastLineInPage1 limit 5 allow filtering
And then what?
And then from here you should be able to paginate your results at the clients’ request: as long as they can tell you what the last key or keys were and how many results they want back, Cassandra should be able to make it happen.
Hope you enjoyed this small blog post! Until next time,
Cheers!
3 comments
Emad Heydari Beni
19/02/2016 at 10:24First of all, thanks for your brilliant article about pagination in Cassandra. In my opinion, it’s one of the only ones about this subject.
In your example the compound key is => (book, line)
So book is the partition key and line is the cluster key.
Although your approach with QueryBuilder was nice, it’s also possible to use QueryBuilder entirely. Something like this:
QueryBuilder.select().from(CASSANDRA_KEYSPACE, TABLE).allowFiltering()
.where(QueryBuilder.gte(QueryBuilder.token(book), QueryBuilder.fcall(“token”, lastBookInPage1)))
.and(QueryBuilder.gt(“line”, lastLineInPage1)).limit(5);
Actually the point is in “QueryBuilder.fcall()”. There can be another situation as follows:
Compound key as follows => ((book, chapter), line)
So “book” and “chapter” are our partition keys and line is our cluster key. This is how we do this scenario:
QueryBuilder.select().from(CASSANDRA_KEYSPACE, TABLE).allowFiltering()
.where(QueryBuilder.gte(QueryBuilder.token(book, chapter), QueryBuilder.fcall(“token”, lastBookInPage1, lastChapterInPage1)))
.and(QueryBuilder.gt(“line”, lastLineInPage1)).limit(5);
codesandnotes
19/02/2016 at 10:51Double-kudos Emad,
I totally missed that fcall() method you are using, which forced me to implement my own Clause class. Obviously this is to be avoided now, given that your solution is way simpler and cleaner.
Cheers to you mate!
Dodong Juan
07/03/2016 at 19:57Guys,
I think this query will not work. What will happen is that only the “line” which i greater than lastLineInPage1 will be returned. Any lines < than lastLineInPage1 but in the chapter greater than lastChapterInPage1 will not be return.
Right ?