DocumentDB Stored Procedure vs User Defined Function follow up.

It was pointed out to me that in my previous post my sprocs and UDFs were performing calls to toLowerCase() and that it could be slowing down my results. After a bit of modifying the documents in my DocumentDB collection I now have 2 properties, one called “Message” which is the original text and one called MessageLower (can you guess?). Now no toLowerCase() needs to be called when doing comparisons. The change in execution time is miniscule. For the Sproc (which was fine performance wise) the average execution time went down about 3-4ms (still around 130-140ms). Now maybe for some systems 3-4ms is critical but for the system I’m working on doubling the size of the documents to get 3-4ms off isn’t a worthy trade off.

I’m definitely glad it was suggested (was on my todo list regardless) and that I tested it. At least now I know that in the big scheme of things, toLowerCase() doesn’t have *that* much of an impact.

DocumentDB Stored Procedures and User Defined Functions

Ok, time for confession. I’m not a fan of stored procedures (SProcs) within databases. I try not to be a hammer coder (one tool/language for all problems) but it’s more the fact I personally find SQL SProcs to be harder to read/develop and debug. Give me a nice C# source base any day. Now, I DO realise the proposed benefits of SProcs, do all data calculations on the server BEFORE they get returned top the application. Fewer bytes on the wire, quicker to transmit etc. BUT… when the DB and the application are co-located (both within the same Azure location) do we REALLY need to worry about data transfer between DB and app? Well, maybe. Depending if we’re talking about HUGE volumes of traffic or not. For now, I’m assuming not.

I’m not the NSA…

(or am I?) Smile with tongue out

Once I learned that DocumentDB was also introducing SProcs, I was VERY concerned that again I would get involved with a source base that has a huge volume of SProcs that would be hard to debug/deploy.

Remember, I’m highly bias AGAINST SProcs, but all my measuring/testing I’ll be doing for this blog post will be as unbiased as possible.

The simple scenario I’m looking at is searching a collection of documents for a particular term within a particular property (just to keep it easy). Each of these properties consist of 100 randomly selected words.

All of these tests are based on the compute and docdb being co-located in the same geo region.

So firstly, what the SProcs look like?

function ( searchTerm) {
    var context = getContext();
    var collection = context.getCollection();
    var response = context.getResponse();

    var returnMessages = [];

    var lowerSearchTerm;
    if (searchTerm) {
        lowerSearchTerm = searchTerm.toLowerCase();

    function GetDocuments(callback) {
        var qry = 'SELECT c.Message FROM root c'
        var done = collection.queryDocuments(collection.getSelfLink(), qry, { pageSize: 1000 }, callback);

    function callback(err, documents, responseOptions) {
        var messages = documents;

        for (var i = 0; i < messages.length; i++) {
            var message = messages[i];
            if ( message.Message && message.Message.toLowerCase().indexOf(lowerSearchTerm) > -1) {
                returnMessages.push( message);

This is fairly straightforward and simple. Get all documents, throw to the callback search all the messages/documents by converting to lowercase then perform an indexOf. Simple and straight forward.

Now, my initial test data consisted of 1000 documents, 10 of which had my magical search term. The results were:


The initial query ALWAYS was far longer… assuming something is warming up/compiling/caching etc, but I thought I’d include it in the results anyway.

Ok, so 1000 docs, searching for my term, about 117-148ms for the most part. Cool, I can live with that.

Now, what about User Defined Functions? Now firstly, in case you don’t know what UDF’s are, they’re basically a snippet of Javascript which performs some functionality on a single record (to my knowledge). This Javascript can be called by using the SQL syntax when querying DocumentDB. In my case I needed to write a small UDF to search substrings within the Message property, so in this case the Javascript was:

function(input, searchTerm) {
    return input.Message.toLowerCase().indexOf( searchTerm ) > -1;

There are 2 ways to add UDF’s and SProcs, just as an example the way I initially added the above UDF was through code (as opposed to using a tool such as the very useful DocumentDB Studio).

private void SetupUDF(DocumentCollection collection)

    UserDefinedFunction function = new UserDefinedFunction()
        Id = "SubStringSearch",
        Body = @"function(input, searchTerm) 
                return input.toLowerCase().indexOf( searchTerm ) > -1;

    var t = DocClient.CreateUserDefinedFunctionAsync(collection.SelfLink, function);

Once SetupUDF is called, then we’re able to use the function “SubStringSearch” via the SQL syntax.

var queryString = "SELECT r.Message FROM root r WHERE SubStringSearch( r, 'mysearchterm')";      
var results = DocClient.CreateDocumentQuery<MessageEntity>(collection.SelfLink, queryString);

Hey presto… we now have substring searching available via the SQL Syntax (of course when the DocumentDB team add a “like” type of operator, then this will not be needed). So, how did it perform?

I really had high hopes for this in comparison to the Stored Procedure. My understanding is that the SProc and UDF are “compiled” in some fashion behind the scenes and aren’t interpreted at query time. I also thought that since the UDF is called within a SQL statement which is completely run on the DocumentDB storage servers then the performance would be comparable to the SProc. I was wrong. Really wrong.

The results for the same set of documents were:


That’s 3-4 times worse than the SProc. Not what I hoped nor wanted. I’ve double checked the code, but alas the code is so small that I doubt even *I* could mess it up. I think.

So what about larger data sets? Instead of searching 1000 documents for a term that only appears in 10, what about 7500? (or more precisely 7505 since that when I got bored waiting for the random doc generator to finish)

It’s worse.

The SProc got:


Which is comparable to the results it previously got. But the UDF seems to scale linearly…  it got:


Ignoring those last 2 entries for a moment, it looks like if I increase the document collection by 7.5 times (1000 to 7505) then my times also appear to increase by a similar factor. This was definitely unexpected.

Now, those last 2 entries are interesting. I’ve only shown one of my test runs here, but with virtually every test run performed I’d end up with one super large query time. This was due to a RequestRateTooLargeException being thrown and the LINQ provider retrying the request. Why would the UDF method be getting this and it appears that the SProc does not, even though the SProc does execute the query: “select c.Message from root c”  (ie get EVERY document)

So it looks like UDFs are slower and do not scale. Also one fact I discovered is that you can only call a single UDF per SQL query, but I’m guessing this is just an artificial limitation the DocumentDB team has enforced until the technology becomes more mature.

It is a disappointment that UDFs are not as quick (or even comparable) to the SProcs but I’m not giving up hope yet. If SProcs can be that quick, then (to my simplistic mind) I can’t see why UDF’s couldn’t be nearly as quick in the future.

As a closing note, while trawling through the fiddler traces when performing the tests I discovered some scary facts that relate to UDFs and the linear performance. When I executed the SProcs for testing I was getting Request Charges of:



But for the UDF approach the Request Charges were:


I have not investigated further on the charges, but is certainly on my to-do list.

Conclusions to all of this? As much as I dislike SProcs in general (and business logic being able to creep into the datastore layer) I think I’ll have to continue using them.

DocumentDB is still my favourite storage option for my various projects (more features than Azure Table Storage but not as huge/crazy as Azure Database). It has its limitations, but the service/platform is young.

I’m definitely going to be re-running my tests to keep an eye on the UDF performance. One day, maybe UDF will be good enough that I can say goodbye to SProcs forever (cue party music…   fade out)

Azure DocumentDB performance thoughts

Updated: Typos and clarifying collections.

I’ve been developing against Azure DocumentDB storage for over 6 months now and have to say, overall I’m impressed. It gives me more than Azure Table storage (great key/value lookup but no searching via other properties) but isn’t a 800 pound gorilla of Azure Database. For me it sits nicely between the two, giving me easy development/deployment but also lets me index which fields I like (admittedly I’m sticking with the default of “all”) and query against them.

Now, my development hasn’t just been idle curiosity with a bit of tinkering here and there, but is a commercial application that is out in the wild (although in beta) currently. It is critical that language support, tooling, performance and documentation quality is met. For the most part it has, I’m personally very happy with it and will push for us to continue using it where appropriate.

Initially DocumentDB was NOT available in the region where my Azure Web Roles/VM’s where running (during development we had Web Roles running out of Singapore but DocumentDB out of west-us). This was fine for development purposes but was a niggling concern that *when* will DocumentDB appear in Singapore? Well finally it did, and the performance change “felt” to improve.

Felt…  tricky word. I swear sometimes when I tinker with my machine it “feels” faster…  but it’s probably just mind over matter. (Personally I’d love to be involved in some medical trial where I end up with a placebo. I swear it would cure me of virtually anything… or at least I feel it would) Smile

Ahem, I digress. So it “felt” faster  once DocumentDB appeared in Singapore but I know others didn’t really notice any difference. Admittedly there are LOTS of moving parts in the application and DocumentDB is just one small cog in a big machine. Maybe I was bias, maybe I was the only one paying attention, maybe I was fooling myself? Time to crank out Visual Studio and see what lies/statistics and benchmarks will tell me.

One of our development accounts had enough data to make it mostly realistic (ie not just a tiny tiny sample of data which wouldn’t prove anything). But that was sitting in west-us…   so the benchmarks I took were slightly the reverse of what production was.

In production we have the VM/WebRole and DocumentDB in Singapore where as previously we have VM/Webrole in Singapore and DocumentDB in West-US. For the purposes of my benchmarking I’ve kept the DocumentDB in west-us (test data) and have 2 VM’s setup to do the testing. One in west-us and one in Singapore.

First, some notes about the setup. Originally we had 4 collections setup with a given DocumentDB account (for explanation of a collection, see here). The query was through the LINQ provider (using SQL syntax) with a couple of simple where conditions (company = x and userid = y type of thing). Very simple, very straight forward. The query was also only executed against one of the collections. The other collections had data but were not relevant for this query.

So, what did I find?

When the test was run on a VM in Singapore against DocumentDB in west-us, the runtime results were:







Giving an average of 3915ms

Where as running the same test in the west-us resulted in:







With an average of 485ms.

That’s an improvement of 88%. This really shouldn’t be a surprise, the Pacific ocean is a tad large. I bet all those packets got very soggy and slowed downWinking smile

Another change that I’ve been working on is merging our 4 collections into a single collection. It has been stressed by the DocumentDB team that collections are not tables. Regardless of this, when we setup our collections originally we did make them as if they were tables. ie a single type of entity would be stored in a single collection. Although I’ll eventually end up with just the single combined collection, during these tests all 5 collections all co-existed within the same DocumentDB account.

I’ve been modifying/copying the data from the 4 collections to a single “uber collection” which really is the way it should have been done in the first place. My only real source of confusion is when querying this combined collection how do we know what to serialize the response objects as?

ie if I perform a query and I get a mix of results (class A and class B), how do I deal with it? This really was an artificial problem. The reality is that my queries really didn’t change (that much). If I was originally querying collection 1 for results I’d always get back results serialized as a list of Class A objects. If I’m doing the same query against the combined collection I should still get the same results. The only change I did to the objects (and the query) was that in each Document stored in this combined collection I added a “DocType” property which was assigned some number (really enum). This way I could modify my query to be something like:   “….. original query…..  AND e.DocType=1”   etc.

This just gave me a little piece of mind that my queries would only return a single Document Type and that I wouldn’t have to “worry my pretty little head” over some serialization trickery later on.

So… what happened? Is a combined collection better or worse performance wise? A resounding BETTER is the answer. For the *exact* same data (just copying the documents from the 4 collections into the combined collection) and adding the DocType property I got the following results:

WebRole in Singapore with DocumentDB in west-us:







Giving an average of 3609ms. This is an 8% improvement.

For everything in west-us I then got:







With the average being 152ms. This is an improvement of 69%!!!!  HOW??? WHY???? (not that I’m complaining mind you). What appears to have happened is that regardless of compute vs storage location approximately 300ms has been shaved off the query time. ie The average for compute/storage in different locations went from 3915ms to 3609ms with a difference of 306ms. When we have compute and storage in the same location the averages were 485ms to 152ms, having a difference of 333ms.

I’ll be asking the DocumentDB production team for any advice/reasoning around this merely to satisfy my own curiosity but hey, not going to look a gift horse in the mouth.

When I get some time I’ll do some more tests to see if this DocType property I added somehow improve the performance. If I added that to the scenario where I had the 4 collections, would it speed things up? I can’t see how, since I’m just using it to filter document entity types and for the test when I have multiple collections I’m really only querying one of them (which has a single entity type in it). More investigations to follow…..