Azure Table Storage, a revisit

It’s been a while since I used Azure Table Storage (ATS) in anger. Well, kind of. I use it most days for various projects but it’s been about 18 months since I tried performing any bulk loads/querying/benchmarking etc.

The reason for my renewed interest is that a colleague mentioned that as they added more non indexed parameters to their ATS query, it was slowing down in a major fashion. This didn’t tally with my previous experience. So I wondered. Yes, if you don’t query ATS via the Partition Key and Row Key (ie the indexed fields) then it gets a lot slower, but everything is relative, right? It might be a lot slower, but it could still be “quick enough” for your purpose. So, I’ve coded some very very basic tests and tinkered.

My dummy data consisted of 2 million rows, split equally over 5 partitions. Each entity consisted of 10 fields (including partition key and row key). The fields (excluding row and partition) were 2 x strings, 2 x ints, 2 x doubles and 2 x dates. Currently my tests only focus on the strings and 1 of the ints (but I’ll be performing more tests in the near future).

The string fields were populated with a random word selected from a pool of 104. The ints and doubles were random between 0 and 1 million. The datetime was a random date between Jan 1 1970 and 1 Jan 2016. I repeat, the doubles and dates have not yet been included in any tests.

I tested a few different types of queries, starting simple and getting slightly more complex with each change. Firstly there is the query that takes the partition key and a value for field 3 (a string). Interestingly, the results were:

Partition key and field1 4706ms (av)
Partition key, field1 and field2 5368ms (av)
Partition key, field1, field2 and field3 7232ms (av)

So, despite only one field (partition key) being indexed, the adding of the other fields into the search didn’t completely make Azure Table Storage completely unusable and slow. Yes, it *almost* doubled the query time but wasn’t the huge difference that my colleague had experienced.

One thing to remember, although I created 2 million rows, these were split over 5 partition keys, so in effect the above queries were *really* only going over 400k rows.

More tests to follow…..

AzureCopy UI version

Seeing the popularity of AzureCopy increasing (very satisfying I have to admit) I’ve been thinking about writing a few desktop application versions to assist people. These would be “native” applications (not native as in bare metal, but I mean purpose built per OS) and would require a fair bit of design and effort.

The question I keep asking myself is, “Is it worth it?”. Migration from one cloud storage provider to another is a rare task. Even then it would usually (I hope) would be performed by a fairly technical person, so command line tools (such as AzureCopy) isn’t daunting. So is the need/want for a UI based application remotely needed by anyone? Chances are I’ll write them regardless (have already started on the Windows version) but am honestly wondering if I’m writing them without an audience.

Thoughts?

Azurecopy and Virtual Directories.

AzureCopy has finally had some love and has been updated as per some requests that have come in. Firstly, virtual directories in S3 and Azure Blob Storage are now handled in a consistent manner.

Remember, neither S3 nor ABS really have directories. They just use blob names with the ‘/’ character in them and various tools out on the interweb use that to simulate directories.

Now, copying files between S3 and ABS has always been easy, but what if you want to utilise these virtual directories?

eg. I have a blob on S3 called “dir1/subdir1/file1” and I want to copy that to Azure (or elsewhere for that matter). But I want the destination on Azure to be in my temp container and the resulting blob to be just be called “subdir1/file1”.

This example we’re pretending to copy a subdirectory and its file from S3 to Azure. Remember, there is no spoon directory.

Now we can perform the command:

azurecopy.exe –i https://s3.amazonaws.com/mybucket/dir1/ –o https://myacct.blob.core.windows.net/temp/

The result will be in my Azure container (temp) I’ll have a blob called “subdir1/file1”.

In addition, you can now copy these blobs with virtual directories to and from Dropbox but in this case it will make/read REAL directories.

Azurecopy is available via Nuget, command line executable and source.

Dropbox and direct links

During some refactoring of AzureCopy I’ve decided to finally add Azure CopyBlob support for Dropbox. This means that locally you can run a command to copy from Dropbox to Azure Blob Storage and none of the traffic actually goes through where AzureCopy is running, huge bandwidth/speed savings!

The catch is that it appears (I’ve NOT fully confirmed this yet) that Azure CopyBlob doesn’t like redirection URLs, which is what I was receiving from Dropbox. I was generating a “shared” URL for a particular Dropbox file which in turn generates an HTTP 302 redirection and then gives me the real URL. Azure CopyBlob doesn’t play friendly with this. The trick is to NOT generate a “shared” URL but to generate a “media” URL. Quoting from the Dropbox API documentation: “Similar to /shares. The difference is that this bypasses the Dropbox webserver, used to provide a preview of the file, so that you can effectively stream the contents of your media.

Once I made that change, hey presto, no more redirects and Azure CopyBlob is now a happy little ummm “thing”.

Upshot is now I can migrate a tonne of data from Dropbox to Azure without using up any of my own bandwidth.

woohoo Smile

DocumentDB, Node.js, CoffeeScript and Hubot

For anyone that doesn’t already know, Hubot is Githubs ever present “bot” that can be customized to respond to all sorts of commands on a number of different messaging platforms. From what I understand (I don’t work at Github, so I’m just going by what I’ve read) it is used for build/deploy to production (and all other environments), determining employee locations (distributed teams) and a million other things. Fortunately Github has made Hubot open source and anyone can download and integrate it into Skype, Hipchat, Campfire, Slack etc etc. I’ve decided to have a crack at integrating it into my work place, specifically against the Slack messaging system.

I utterly love it.
During a 24 hour “hackday”, I integrated it into Slack (see details) and grabbed a number of pre-existing scripts to start me off. Some obvious ones (for a dev team) are TeamCity integration, statistics and statuses of various 3rd party services that we use and information retrieval from our own production system. This last one will be particularly useful for support, having an easy way to retrieve information about a customer without having to write up new UI’s for every change we do. *very* dev friendly Smile

One thing I’ve been tinkering with is having Hubot communicate directly with the Azure DocumentDB service. Although I’ve only put the proverbial toe in the water I see LOTS of potential here. Hubot is able to run anywhere (behind corporate firewall, out on an Azure Website or anywhere in between). Having it access DocumentDB (which can be accessed by anywhere with a net connection) means that we do not need to modify production services/firewalls etc for Hubot to work. Hubot can then perform these queries, get the statistics/details with ease. This (to me) is a big win, I can provide a useful information retrieval system without having to modify our existing production platform.

Fortunately the DocumentDB team have provided a nice Node.js npm package to install (see here for some examples). This made things trivially easy to do. The only recommendation I’d suggest is for tools/services/hubots that are read-only, just use the read only DocumentDB Key which is available on the Azure Portal. I honestly didn’t realise that read-only keys were available until I recently did some snooping about, and although I’m always confident in my code, having a read-only key just gives me a safety net against production data.

Oh yes, CoffeeScript. I’m not a Javascript person (I’m backend as much as possible, C# these days) and Hubots default language is CoffeeScript. So first I had to deal with JS and THEN deal with CoffeeScript. Yes, this last part is just my personal failing (kicking and screaming into the JS era).

An example of performing a query against DocumentDB in Node.js (in Coffeescript) follows. First you need to get a database reference, then a collection reference (from the DB) then perform the real query you want.

DocumentClient = require(“documentdb”).DocumentClient;
client = new DocumentClient( process.env.HUBOT_DOCUMENTDB_ENDPOINT, “masterKey”:process.env.HUBOT_DOCUMENTDB_READONLY_KEY} );
GetDatabase client, ‘(database) –>
  GetCollection client, database._self, ‘(collection) –>
    client.queryDocuments(collection._self, “select * from docs d where d.id = ‘testid’”).toArray   (err, res) –>
      if res && res.length > 0
        console.log(res[0].SomeData);

GetCollection = (client, databaseLink, callback) –> 
  collectionQuery = { query: ‘SELECT * FROM root r WHERE r.id=”mycollection”’};
    client.queryCollections( databaseLink, collectionQuery).toArray (err, results) –> 
      if !err
        if results.length > 0
            callback( results[0]);

GetDatabase = (client, databaseName, callback ) –>
  dbQuery = { query: ‘SELECT * FROM root r WHERE r.id=”mydatabase”’};
    client.queryDatabases(dbQuery).toArray (err, results) –> 
      if !err
        if results.length > 0  
            callback(results[0]);

Given CoffeeScript is white space sensitive and my blog editor doesn’t appear to allow me to format the code *exactly* how I need to, I’m hoping readers will be able to deduce where the white space is off.

End result is Hubot, Node.js and DocumentDB are really easy to integrate together. Thanks for a great service/library Microsoft!

DocumentDB Stored Procedure vs User Defined Function follow up.

It was pointed out to me that in my previous post my sprocs and UDFs were performing calls to toLowerCase() and that it could be slowing down my results. After a bit of modifying the documents in my DocumentDB collection I now have 2 properties, one called “Message” which is the original text and one called MessageLower (can you guess?). Now no toLowerCase() needs to be called when doing comparisons. The change in execution time is miniscule. For the Sproc (which was fine performance wise) the average execution time went down about 3-4ms (still around 130-140ms). Now maybe for some systems 3-4ms is critical but for the system I’m working on doubling the size of the documents to get 3-4ms off isn’t a worthy trade off.

I’m definitely glad it was suggested (was on my todo list regardless) and that I tested it. At least now I know that in the big scheme of things, toLowerCase() doesn’t have *that* much of an impact.

DocumentDB Stored Procedures and User Defined Functions

Ok, time for confession. I’m not a fan of stored procedures (SProcs) within databases. I try not to be a hammer coder (one tool/language for all problems) but it’s more the fact I personally find SQL SProcs to be harder to read/develop and debug. Give me a nice C# source base any day. Now, I DO realise the proposed benefits of SProcs, do all data calculations on the server BEFORE they get returned top the application. Fewer bytes on the wire, quicker to transmit etc. BUT… when the DB and the application are co-located (both within the same Azure location) do we REALLY need to worry about data transfer between DB and app? Well, maybe. Depending if we’re talking about HUGE volumes of traffic or not. For now, I’m assuming not.

I’m not the NSA…

(or am I?) Smile with tongue out

Once I learned that DocumentDB was also introducing SProcs, I was VERY concerned that again I would get involved with a source base that has a huge volume of SProcs that would be hard to debug/deploy.

Remember, I’m highly bias AGAINST SProcs, but all my measuring/testing I’ll be doing for this blog post will be as unbiased as possible.

The simple scenario I’m looking at is searching a collection of documents for a particular term within a particular property (just to keep it easy). Each of these properties consist of 100 randomly selected words.

All of these tests are based on the compute and docdb being co-located in the same geo region.

So firstly, what the SProcs look like?

function ( searchTerm) {
    var context = getContext();
    var collection = context.getCollection();
    var response = context.getResponse();

    var returnMessages = [];

    var lowerSearchTerm;
    if (searchTerm) {
        lowerSearchTerm = searchTerm.toLowerCase();
    }
    
    GetDocuments(callback);

    function GetDocuments(callback) {
        var qry = 'SELECT c.Message FROM root c'
        var done = collection.queryDocuments(collection.getSelfLink(), qry, { pageSize: 1000 }, callback);
    }

    function callback(err, documents, responseOptions) {
        var messages = documents;

        for (var i = 0; i < messages.length; i++) {
            var message = messages[i];
            if ( message.Message && message.Message.toLowerCase().indexOf(lowerSearchTerm) > -1) {
                returnMessages.push( message);
            }
        }
          
        response.setBody(JSON.stringify(messageThreads));
    }
}

This is fairly straightforward and simple. Get all documents, throw to the callback search all the messages/documents by converting to lowercase then perform an indexOf. Simple and straight forward.

Now, my initial test data consisted of 1000 documents, 10 of which had my magical search term. The results were:

394ms
127ms
148ms
143ms
117ms

The initial query ALWAYS was far longer… assuming something is warming up/compiling/caching etc, but I thought I’d include it in the results anyway.

Ok, so 1000 docs, searching for my term, about 117-148ms for the most part. Cool, I can live with that.

Now, what about User Defined Functions? Now firstly, in case you don’t know what UDF’s are, they’re basically a snippet of Javascript which performs some functionality on a single record (to my knowledge). This Javascript can be called by using the SQL syntax when querying DocumentDB. In my case I needed to write a small UDF to search substrings within the Message property, so in this case the Javascript was:

function(input, searchTerm) {
    return input.Message.toLowerCase().indexOf( searchTerm ) > -1;
}

There are 2 ways to add UDF’s and SProcs, just as an example the way I initially added the above UDF was through code (as opposed to using a tool such as the very useful DocumentDB Studio).

private void SetupUDF(DocumentCollection collection)
{

    UserDefinedFunction function = new UserDefinedFunction()
    {
        Id = "SubStringSearch",
        Body = @"function(input, searchTerm) 
            {
                return input.toLowerCase().indexOf( searchTerm ) > -1;
            }"
    };

    var t = DocClient.CreateUserDefinedFunctionAsync(collection.SelfLink, function);
    t.Wait();
}

Once SetupUDF is called, then we’re able to use the function “SubStringSearch” via the SQL syntax.

var queryString = "SELECT r.Message FROM root r WHERE SubStringSearch( r, 'mysearchterm')";      
var results = DocClient.CreateDocumentQuery<MessageEntity>(collection.SelfLink, queryString);

Hey presto… we now have substring searching available via the SQL Syntax (of course when the DocumentDB team add a “like” type of operator, then this will not be needed). So, how did it perform?

I really had high hopes for this in comparison to the Stored Procedure. My understanding is that the SProc and UDF are “compiled” in some fashion behind the scenes and aren’t interpreted at query time. I also thought that since the UDF is called within a SQL statement which is completely run on the DocumentDB storage servers then the performance would be comparable to the SProc. I was wrong. Really wrong.

The results for the same set of documents were:

527ms
476ms
485ms
464ms
425ms

That’s 3-4 times worse than the SProc. Not what I hoped nor wanted. I’ve double checked the code, but alas the code is so small that I doubt even *I* could mess it up. I think.

So what about larger data sets? Instead of searching 1000 documents for a term that only appears in 10, what about 7500? (or more precisely 7505 since that when I got bored waiting for the random doc generator to finish)

It’s worse.

The SProc got:

420ms
101ms
139ms
106ms
137ms

Which is comparable to the results it previously got. But the UDF seems to scale linearly…  it got:

3132ms
3163ms
3259ms
15369ms
17832ms

Ignoring those last 2 entries for a moment, it looks like if I increase the document collection by 7.5 times (1000 to 7505) then my times also appear to increase by a similar factor. This was definitely unexpected.

Now, those last 2 entries are interesting. I’ve only shown one of my test runs here, but with virtually every test run performed I’d end up with one super large query time. This was due to a RequestRateTooLargeException being thrown and the LINQ provider retrying the request. Why would the UDF method be getting this and it appears that the SProc does not, even though the SProc does execute the query: “select c.Message from root c”  (ie get EVERY document)

So it looks like UDFs are slower and do not scale. Also one fact I discovered is that you can only call a single UDF per SQL query, but I’m guessing this is just an artificial limitation the DocumentDB team has enforced until the technology becomes more mature.

It is a disappointment that UDFs are not as quick (or even comparable) to the SProcs but I’m not giving up hope yet. If SProcs can be that quick, then (to my simplistic mind) I can’t see why UDF’s couldn’t be nearly as quick in the future.

As a closing note, while trawling through the fiddler traces when performing the tests I discovered some scary facts that relate to UDFs and the linear performance. When I executed the SProcs for testing I was getting Request Charges of:

628.82
627.18
629.06
617.53
629.18

 

But for the UDF approach the Request Charges were:

6911.24
6926.76
6913
7047.82
6903.06

I have not investigated further on the charges, but is certainly on my to-do list.

Conclusions to all of this? As much as I dislike SProcs in general (and business logic being able to creep into the datastore layer) I think I’ll have to continue using them.

DocumentDB is still my favourite storage option for my various projects (more features than Azure Table Storage but not as huge/crazy as Azure Database). It has its limitations, but the service/platform is young.

I’m definitely going to be re-running my tests to keep an eye on the UDF performance. One day, maybe UDF will be good enough that I can say goodbye to SProcs forever (cue party music…   fade out)