DocumentDB Stored Procedures and User Defined Functions

Ok, time for confession. I’m not a fan of stored procedures (SProcs) within databases. I try not to be a hammer coder (one tool/language for all problems) but it’s more the fact I personally find SQL SProcs to be harder to read/develop and debug. Give me a nice C# source base any day. Now, I DO realise the proposed benefits of SProcs, do all data calculations on the server BEFORE they get returned top the application. Fewer bytes on the wire, quicker to transmit etc. BUT… when the DB and the application are co-located (both within the same Azure location) do we REALLY need to worry about data transfer between DB and app? Well, maybe. Depending if we’re talking about HUGE volumes of traffic or not. For now, I’m assuming not.

I’m not the NSA…

(or am I?) Smile with tongue out

Once I learned that DocumentDB was also introducing SProcs, I was VERY concerned that again I would get involved with a source base that has a huge volume of SProcs that would be hard to debug/deploy.

Remember, I’m highly bias AGAINST SProcs, but all my measuring/testing I’ll be doing for this blog post will be as unbiased as possible.

The simple scenario I’m looking at is searching a collection of documents for a particular term within a particular property (just to keep it easy). Each of these properties consist of 100 randomly selected words.

All of these tests are based on the compute and docdb being co-located in the same geo region.

So firstly, what the SProcs look like?

function ( searchTerm) {
    var context = getContext();
    var collection = context.getCollection();
    var response = context.getResponse();

    var returnMessages = [];

    var lowerSearchTerm;
    if (searchTerm) {
        lowerSearchTerm = searchTerm.toLowerCase();

    function GetDocuments(callback) {
        var qry = 'SELECT c.Message FROM root c'
        var done = collection.queryDocuments(collection.getSelfLink(), qry, { pageSize: 1000 }, callback);

    function callback(err, documents, responseOptions) {
        var messages = documents;

        for (var i = 0; i < messages.length; i++) {
            var message = messages[i];
            if ( message.Message && message.Message.toLowerCase().indexOf(lowerSearchTerm) > -1) {
                returnMessages.push( message);

This is fairly straightforward and simple. Get all documents, throw to the callback search all the messages/documents by converting to lowercase then perform an indexOf. Simple and straight forward.

Now, my initial test data consisted of 1000 documents, 10 of which had my magical search term. The results were:


The initial query ALWAYS was far longer… assuming something is warming up/compiling/caching etc, but I thought I’d include it in the results anyway.

Ok, so 1000 docs, searching for my term, about 117-148ms for the most part. Cool, I can live with that.

Now, what about User Defined Functions? Now firstly, in case you don’t know what UDF’s are, they’re basically a snippet of Javascript which performs some functionality on a single record (to my knowledge). This Javascript can be called by using the SQL syntax when querying DocumentDB. In my case I needed to write a small UDF to search substrings within the Message property, so in this case the Javascript was:

function(input, searchTerm) {
    return input.Message.toLowerCase().indexOf( searchTerm ) > -1;

There are 2 ways to add UDF’s and SProcs, just as an example the way I initially added the above UDF was through code (as opposed to using a tool such as the very useful DocumentDB Studio).

private void SetupUDF(DocumentCollection collection)

    UserDefinedFunction function = new UserDefinedFunction()
        Id = "SubStringSearch",
        Body = @"function(input, searchTerm) 
                return input.toLowerCase().indexOf( searchTerm ) > -1;

    var t = DocClient.CreateUserDefinedFunctionAsync(collection.SelfLink, function);

Once SetupUDF is called, then we’re able to use the function “SubStringSearch” via the SQL syntax.

var queryString = "SELECT r.Message FROM root r WHERE SubStringSearch( r, 'mysearchterm')";      
var results = DocClient.CreateDocumentQuery<MessageEntity>(collection.SelfLink, queryString);

Hey presto… we now have substring searching available via the SQL Syntax (of course when the DocumentDB team add a “like” type of operator, then this will not be needed). So, how did it perform?

I really had high hopes for this in comparison to the Stored Procedure. My understanding is that the SProc and UDF are “compiled” in some fashion behind the scenes and aren’t interpreted at query time. I also thought that since the UDF is called within a SQL statement which is completely run on the DocumentDB storage servers then the performance would be comparable to the SProc. I was wrong. Really wrong.

The results for the same set of documents were:


That’s 3-4 times worse than the SProc. Not what I hoped nor wanted. I’ve double checked the code, but alas the code is so small that I doubt even *I* could mess it up. I think.

So what about larger data sets? Instead of searching 1000 documents for a term that only appears in 10, what about 7500? (or more precisely 7505 since that when I got bored waiting for the random doc generator to finish)

It’s worse.

The SProc got:


Which is comparable to the results it previously got. But the UDF seems to scale linearly…  it got:


Ignoring those last 2 entries for a moment, it looks like if I increase the document collection by 7.5 times (1000 to 7505) then my times also appear to increase by a similar factor. This was definitely unexpected.

Now, those last 2 entries are interesting. I’ve only shown one of my test runs here, but with virtually every test run performed I’d end up with one super large query time. This was due to a RequestRateTooLargeException being thrown and the LINQ provider retrying the request. Why would the UDF method be getting this and it appears that the SProc does not, even though the SProc does execute the query: “select c.Message from root c”  (ie get EVERY document)

So it looks like UDFs are slower and do not scale. Also one fact I discovered is that you can only call a single UDF per SQL query, but I’m guessing this is just an artificial limitation the DocumentDB team has enforced until the technology becomes more mature.

It is a disappointment that UDFs are not as quick (or even comparable) to the SProcs but I’m not giving up hope yet. If SProcs can be that quick, then (to my simplistic mind) I can’t see why UDF’s couldn’t be nearly as quick in the future.

As a closing note, while trawling through the fiddler traces when performing the tests I discovered some scary facts that relate to UDFs and the linear performance. When I executed the SProcs for testing I was getting Request Charges of:



But for the UDF approach the Request Charges were:


I have not investigated further on the charges, but is certainly on my to-do list.

Conclusions to all of this? As much as I dislike SProcs in general (and business logic being able to creep into the datastore layer) I think I’ll have to continue using them.

DocumentDB is still my favourite storage option for my various projects (more features than Azure Table Storage but not as huge/crazy as Azure Database). It has its limitations, but the service/platform is young.

I’m definitely going to be re-running my tests to keep an eye on the UDF performance. One day, maybe UDF will be good enough that I can say goodbye to SProcs forever (cue party music…   fade out)


Azure DocumentDB performance thoughts

Updated: Typos and clarifying collections.

I’ve been developing against Azure DocumentDB storage for over 6 months now and have to say, overall I’m impressed. It gives me more than Azure Table storage (great key/value lookup but no searching via other properties) but isn’t a 800 pound gorilla of Azure Database. For me it sits nicely between the two, giving me easy development/deployment but also lets me index which fields I like (admittedly I’m sticking with the default of “all”) and query against them.

Now, my development hasn’t just been idle curiosity with a bit of tinkering here and there, but is a commercial application that is out in the wild (although in beta) currently. It is critical that language support, tooling, performance and documentation quality is met. For the most part it has, I’m personally very happy with it and will push for us to continue using it where appropriate.

Initially DocumentDB was NOT available in the region where my Azure Web Roles/VM’s where running (during development we had Web Roles running out of Singapore but DocumentDB out of west-us). This was fine for development purposes but was a niggling concern that *when* will DocumentDB appear in Singapore? Well finally it did, and the performance change “felt” to improve.

Felt…  tricky word. I swear sometimes when I tinker with my machine it “feels” faster…  but it’s probably just mind over matter. (Personally I’d love to be involved in some medical trial where I end up with a placebo. I swear it would cure me of virtually anything… or at least I feel it would) Smile

Ahem, I digress. So it “felt” faster  once DocumentDB appeared in Singapore but I know others didn’t really notice any difference. Admittedly there are LOTS of moving parts in the application and DocumentDB is just one small cog in a big machine. Maybe I was bias, maybe I was the only one paying attention, maybe I was fooling myself? Time to crank out Visual Studio and see what lies/statistics and benchmarks will tell me.

One of our development accounts had enough data to make it mostly realistic (ie not just a tiny tiny sample of data which wouldn’t prove anything). But that was sitting in west-us…   so the benchmarks I took were slightly the reverse of what production was.

In production we have the VM/WebRole and DocumentDB in Singapore where as previously we have VM/Webrole in Singapore and DocumentDB in West-US. For the purposes of my benchmarking I’ve kept the DocumentDB in west-us (test data) and have 2 VM’s setup to do the testing. One in west-us and one in Singapore.

First, some notes about the setup. Originally we had 4 collections setup with a given DocumentDB account (for explanation of a collection, see here). The query was through the LINQ provider (using SQL syntax) with a couple of simple where conditions (company = x and userid = y type of thing). Very simple, very straight forward. The query was also only executed against one of the collections. The other collections had data but were not relevant for this query.

So, what did I find?

When the test was run on a VM in Singapore against DocumentDB in west-us, the runtime results were:







Giving an average of 3915ms

Where as running the same test in the west-us resulted in:







With an average of 485ms.

That’s an improvement of 88%. This really shouldn’t be a surprise, the Pacific ocean is a tad large. I bet all those packets got very soggy and slowed downWinking smile

Another change that I’ve been working on is merging our 4 collections into a single collection. It has been stressed by the DocumentDB team that collections are not tables. Regardless of this, when we setup our collections originally we did make them as if they were tables. ie a single type of entity would be stored in a single collection. Although I’ll eventually end up with just the single combined collection, during these tests all 5 collections all co-existed within the same DocumentDB account.

I’ve been modifying/copying the data from the 4 collections to a single “uber collection” which really is the way it should have been done in the first place. My only real source of confusion is when querying this combined collection how do we know what to serialize the response objects as?

ie if I perform a query and I get a mix of results (class A and class B), how do I deal with it? This really was an artificial problem. The reality is that my queries really didn’t change (that much). If I was originally querying collection 1 for results I’d always get back results serialized as a list of Class A objects. If I’m doing the same query against the combined collection I should still get the same results. The only change I did to the objects (and the query) was that in each Document stored in this combined collection I added a “DocType” property which was assigned some number (really enum). This way I could modify my query to be something like:   “….. original query…..  AND e.DocType=1”   etc.

This just gave me a little piece of mind that my queries would only return a single Document Type and that I wouldn’t have to “worry my pretty little head” over some serialization trickery later on.

So… what happened? Is a combined collection better or worse performance wise? A resounding BETTER is the answer. For the *exact* same data (just copying the documents from the 4 collections into the combined collection) and adding the DocType property I got the following results:

WebRole in Singapore with DocumentDB in west-us:







Giving an average of 3609ms. This is an 8% improvement.

For everything in west-us I then got:







With the average being 152ms. This is an improvement of 69%!!!!  HOW??? WHY???? (not that I’m complaining mind you). What appears to have happened is that regardless of compute vs storage location approximately 300ms has been shaved off the query time. ie The average for compute/storage in different locations went from 3915ms to 3609ms with a difference of 306ms. When we have compute and storage in the same location the averages were 485ms to 152ms, having a difference of 333ms.

I’ll be asking the DocumentDB production team for any advice/reasoning around this merely to satisfy my own curiosity but hey, not going to look a gift horse in the mouth.

When I get some time I’ll do some more tests to see if this DocType property I added somehow improve the performance. If I added that to the scenario where I had the 4 collections, would it speed things up? I can’t see how, since I’m just using it to filter document entity types and for the test when I have multiple collections I’m really only querying one of them (which has a single entity type in it). More investigations to follow…..

Iceberg Example

In my previous post I examined how to use BlobSync to create a tool that not only uploads/downloads deltas to Azure Blob Storage (and hence saving LOTS of bandwidth), but also how to keep multiple versions in the cloud easily.

As a sample file for uploading/downloading I’ve picked the entire Sherlock Holmes collection. Big enough that it can show the benefits of dealing with deltas for bandwidth savings, but small enough that it can be easily edited (text).

Firstly, I perform the original upload.


Here you can see that the original sherlock file about 3.6M and for the initial upload the entire file is uploaded (indicated by the “Uploaded 3868221 bytes” message).

Then I list the blobs and it shows I only have 1 version (called “sherlock” as expected).


Now, I edit the sherlock file and modify a few lines here and there, and reupload it.



We can instantly see that this time the upload only transferred 100003 bytes. Which is about 2.6% of the original file size. Which is a nice saving.

Then we list the blobs associated with “sherlock” again. This time we see 2 versions:

  • sherlock 8/01/2015 11:36:09 AM +00:00
  • sherlock.v1 8/01/2015 11:36:01 AM +00:00

Here we see sherlock and sherlock.v1.  The original sherlock blob that was uploaded was renamed to sherlock.v1. The new sherlock uploaded is now the vanilla “sherlock” blob.

Note: The timestamps still need a little work. The ones displayed are when blobs were copied/uploaded. This means that sherlock.v1 doesn’t have the original timestamp when sherlock was originally uploaded but when it was copied from sherlock to sherlock.v1. But I can live with that for the moment.

Now, say I realise that I really want to have a copy of the original sherlock. The problem is that my local version has been modified. No problems, now I can tell update my local file with the contents of sherlock.v1 (remember, thats the original one I uploaded).


The download was 99k (again, not the 3.6M of the full file). In my case the c:\temp\sherlock is now updated to be the same as the blob sherlock.v1 (ie the original file). How can I be sure?

Well, I happen to have a spare copy of the original sherlock file on my machine (c:\temp\sherlock-orig), and you can see from my file compare (fc.exe) that the original sherlock and my newly updated local copy are the same.

Now I can upload/download deltas AND have multiple versions available to me for future reference.

So, what happens with all my backups I don’t want? Well, you can always load up any Azure Storage Explorer program and delete the blobs you don’t want. Or you can use Icerberg to prune them for you.

Say I’ve created a few more versions of sherlock.


But I’ve decided that I only want to keep the latest 2 backups (ignoring the most current one). ie I want to keep sherlock, sherlock.v2 and sherlock.v3.

I can issue the prune command as such:


Here I tell it prune all but the latest 2 backups of the sherlock blob. I list the blobs afterwards and you can indeed see that apart from the latest (sherlock) there are only the 2 latest backups.

I’m starting to look at using this for more of my own personal backups. Hopefully this may be of use to others.

Versioned backups using BlobSync

As previously described, the BlobSync library (Github, Nuget, Blog) can be used to update Azure block blobs without having to upload the entire file/blobs. It perform an intelligent delta calculation and uploads the minimal data possible.

So, what’s next?

To show possible use cases for BlobSync, this post will outline how it is easily possible to create a backup application that not only uploads the minimal data required but also keeps a series of backups so you can always restore a previously saved blob.

The broad design of the program is as follows:

  • Allow uploading (updating) of blobs.
  • Allow downloading (updating of local files) of blobs
  • Allow multiple versions of blobs to exist and prune what we don’t want.

For this I’m using Visual Studio 2013, other versions may work fine but YMMV. The version of BlobSync I’m using is the latest available at time of writing (0.3.0) and can be installed through Nuget as per any other package (for those who are new to Nuget, please see the Nuget documentation).

Of the three requirements listed above only the last one really adds any new functionality above BlobSync. For the upload/download I really am just using a couple of equivalent methods in BlobSync. For the multiple versions we need to figure out which approach to use.

What I decided on (and has been working well) is that for updating of an existing blob, the following process is used:

  • Each blob will have a piece of metadata which has the latest version number of the blob
  • On upload the existing blob is copied to another blob with the name <original blob name>.v<latest version number>. (along with paired signature blob)
  • New delta is uploaded against existing blob.

For example, say we have a blob called “myfile”. This means we also have a “myfile.0.sig” which is the paired signature blob.

When we upload a new version of myfile the following happens:

  • copy myfile to myfile.v.1
  • copy myfile.0.sig to myfile.v.1.0.sig
  • upload delta against myfile

This means that myfile is now the latest version and myfile.v.1 is the version that previously existed. If we repeat this process then again myfile will be the latest and what used to be myfile will now be myfile.v.2 and so on. It should be noted that the copying of the blobs is performed by the brilliantly useful Azure CopyBlob API which allows Azure it copy the blob itself and doesn’t require any traffic between the application and Azure Blob Storage. This is a BIG time saver!

Now that we’d have myfile, myfile.v.1 and myfile.v.2 we should also be able to use this new project to download any version of the file. More importantly be able to just download the deltas to reduce bandwidth usage (since that is the aim of the game).

So this is the high level design in mind…   you might want to look at the implementation.

BlobSync and Sigexplorer updates!

Both BlobSync (Nuget and binary release) as well as Sigexplorer have been updated with some nice improvements.


BlobSync now has parallel uploading of the binary deltas to Azure Blob Storage. Sounds like an obvious improvement (which I’ll continue to expand/improve) but wanted to make sure all the binary delta edge cases were working before adding tasks/threads into the mix. Currently the parallel factor is only 2 (this will be soon configurable) but it’s enough to prove it works. There have been some very tough bugs to squash since the 0.2.2 release, particularly around very small adjustments (byte or two) at the end of files being updated. These were being missed out previously, this is now fixed.

A small design change is how BlobSync uses small signatures when trying to determine how to match against new content. The problem is when we should and should NOT reuse small signatures.

For example (sorry for dodgy artwork), say we have a blob with some small signatures contained in it:



Then we extend the blob and during the update process we need to see if we have any existing signatures that can be reused in the new area:




The problem we have is that if these small signatures are a few bytes in size and they’re trying to find matches in the new area (yellow) there is a really good chance that they’ll get a match. After all, there are only 256 values to a byte! So what we’ll end up with is a new area that is potentially reusing a lot of small signatures instead of making a new block/signature and uploading the new data. Now strictly speaking we usually want to reuse as many signatures/blocks as we can but the problem with using so many tiny blocks is that we’ll soon fragment our blobs so much that we’ll end up not being able to update properly. Don’t forget a blob can only consist of 50000 blocks maximum.

So a rule BlobSync 0.3.0 has added is that if the byte range we’re looking at (yellow above) is greater than 1000 bytes and the block/signature we’re looking at is greater than 100 bytes then we’ll attempt to match OR  if the byte range and the signature are exactly the same size. This way we’ll hopefully reduce the level of fragmentation and only add the volume of data being uploaded by a small percentage.


Sigexplorer has also been improved when you want to view the signatures being generated. Instead of rendering all signatures at once in the tree structure it will simply populate the “branches” as the user clicks on them. This reduces the load time significantly and makes the entire experience much quicker.

Exploring BlobSync in depth (aka bandwidth savings for Azure Blob Storage).

After receiving some more interest in the BlobSync project (Github and Nuget), I thought I’d go into some more depth of what the delta uploads look like and how you can really tell what BlobSync is really doing.

Firstly we’ll look at a simple example of uploading a text file, modifying it then uploading it again but this time only with the delta.

The text file I’m using is 1.4M in size, not really a situation where bandwidth savings is required but it demonstrates the point. Firstly, the original upload:




So for the first upload (without a previous version of the blob existing in Azure Blob Storage) the full 1.4M had to be uploaded.

Then I edited text.txt, specifically I added a few characters in the first part of the file then I remove some other characters in the lower third. The second update looks like:



So here we see the second time around we only needed to actually upload 3840 bytes. Definitely a good saving.

The question is, what *really* happened behind the scenes. To examine what happened we need an addition tool called SigExplorer (Github and binary) as well as downloading the signature files associated with the blobs.

Details of what is contained in the signature files is covered in an earlier post, but a quick explanation is that a signature file contains hashes of “chunks” from our main files. If multiple signatures match then they contain the same data and therefore can be reused. To get the signature files for the above blobs I needed to upload the first blob, get the signature file, perform the second upload and then get the updated signature file. This way I have 2 versions of the signature file for comparison. To determine the signature file name is we look into the metadata of the blobs. In this case I could see it was blob1.0.sig. Any decent Azure blob tool can find it for you, in my case I like  Azure Storage Explorer:



In this case I retrieved the metadata for my “blob1” blob and could see the “sigurl” metadata had the value of “blob1.0.sig”. This means the signature blob for blob1 was in the same container and with the name “blob1.0.sig”.

So I downloaded that then performed the second upload as shown above, then downloaded the new sig file. This leaves me with 2 signature files, one for the original blob and one for the new one.

Now we have the data, lets use SigExplorer to look at what’s actually happening. First load SigExplorer (sigexplorer.exe) and drag the original signature file into the left panel and the updated signature into the right one.

(in my case I’ve renamed them first-upload-blob1.0.sig and second-upload-blob1.0.sig



Even before we start expanding the tree structures we get a number of useful facts from the initial screen. As indicated by the red arrows we know the original blob size (1.4M) and how much of the data was shared when the second upload happened. In this case 1404708 bytes were NOT required to be re-uploaded because they were already available in Azure Blob Storage!! The final red arrow clearly tells us that only 3840 bytes were required to be uploaded. So instead of uploading the full 1.4M we only uploaded a measly  3.5k. Admittedly we already knew this because when we executed the BlobSyncCmd.exe it had told us. But hey, nice to see the signature files confirming this.

The 2 tree structures (left for original, right for updated) represent the individual blocks that comprise the blob.  In the left tree we see 2000 (704). This is telling us that starting at the beginning of the file we have 704 blocks of 2000 bytes in size. The expanded tree showing 0, 2000, 4000 etc are the individual blocks and are showing the offsets in the file that they are located. Honestly the left tree is always fairly boring, the right hand tree is where the details really are. In this case again we start with blocks of 2000 bytes but we see we only have 159 of them. Then we have a single block of 11 bytes (this represents *some* of the text that I added). Then we follow with 364 blocks of 2000 in size followed by a single block of 1829 bytes. This block previously was a 2000 byte block but I had deleted 71 bytes when modifying the file. Then we continue with more blocks which are the same as the original file.

If we expand these changed blocks we can see more details:





In the above image we can see a bunch of green nodes. This represents blocks that have been reused, ie NOT uploaded multiple times. Green is good, green is your friend. The 2 numbers indicate where a block is located in the new blob and where it used to be in the old blob. So we can see everything up to offset 314000 basically hasn’t shifted. Offset 316000 is in black and represents a 2000 byte block that needed to be uploaded. This is due to nowhere in the old blob were the exact same 2000 bytes existing in any other block. Following that we see the single block of 11 bytes. So far then we’ve had to upload 2000+11 bytes.

We can also see that we are able to reuse more blocks but now they’ve actually moved location, for example the 2000 byte block that is now located at 318011 used to exist at location 318000 (obviously due to we’ve added in that 11 byte block previously). This is the real key to why BlobSync (and similar methods) is so effective. We can find and reuse blocks even when they’ve moved anywhere else in the blob.

Now we’ll have a quick look at where I removed data:




In this case we can see we added in a single block of 1829 bytes. This used to be a 2000 byte block but as previously mentioned I removed a few bytes. This means the old 2000 byte block couldn’t be used anymore so we had to discard that and introduce a new block.


In total from the signatures we can see we’ve had to upload 2000 + 11 + 1829 bytes (which  gives us the 3840 bytes we see at the top of the screen).


Now, it’s all well and good to show a working example with text files that aren’t really large, but what about something a little more practical. Say, a SQL Server backup file? Offsite backups are always useful and it would be great not to have to upload the entire thing every time.




Although not huge files we can see that we’re now getting a bit more practical. In this case I backed up a database, did a couple of modifications and backed it up again. After uploading, grabbing signature files etc we can see the results.

In this case the second upload only required 340k to go over the wire instead of the original 13.8M. That’s a pretty good saving. As we can see many of the 2000 byte blocks were broken down/replaced by different sized blocks. Eventually we’ll want to combine some of these smaller blocks into larger blocks but that’s for a future post.


Happy Uploading!

BlobSync updates.

The BlobSync library (at Github and Nuget) has been updated with some small improvements and fixes. What has also been updated is the example executable (called BlobSyncCmd) which can be compiled from the source or downloaded here.

This sample executable is really just a wrapper around the library but is itself a fairly useful tool. Just as a reminder, the theory behind BlobSync can be read in an earlier post. This post covers setting up the executable and some examples of what you can do with it.

Firstly, download the latest release (version 0.2.1 at time of writing) then uncompress it.

There are only 3 configuration values in the BlobSyncCmd.exe.config file:

– AzureAccountKey : This is the usual account key available from the Azure Portal

– AzureAccountName: again, from the portal.

– SignatureSize: This defines the granularity that the file will be uploaded to Azure Blob Storage (ABS). Each of these “blocks” can be independently replaced which is the key to being able to update existing blobs. For example, if you set SignatureSize to 10000 (10k) and you upload a file, modify a single byte in the local copy and upload it again then the approx 10k delta will be uploaded (instead of the entire file). The smaller this value, the smaller the bandwidth requirements although this does limit the overall size of the Azure blobs. This is due to Azure Block Blobs (which BlobSync uses) have a maximum of 50000 blocks. So if you define a block (signature) to be 1k in size then the maximum size of the overall blob is 50000 * 1k  == 50M.


BlobSyncCmd Examples:


I have a text file which is about 1.4M in size. I upload using BlobSyncCmd:




Given this is a brand new file all 1410366 bytes had to be uploaded.


Now I edit the file and modify a few lines and need to upload the file again:




Now we can see although we originally had to upload 1410366 bytes for the update we only had to transfer 9959 bytes, so a HUGE savings!


Suppose we have someone else who had the original file but now wants to get our latest updates, they can simply download the latest blob and update their existing OLD copy.



Here we can see their older version of test.txt (called test-version1.txt) is updated against what is available in Azure Blob Storage. Instead of having to download all 1.4M they only had to download 9999 bytes. Again, HUGE bandwidth savings!


During the upload not only is the blob itself being uploaded but another “signature” blob is being generated and associated with the main blob. This signature blob has the information which is used to determine which parts of a file can be reused and which need to be replaced for uploads and downloads. For experimentation purposes it is possible to generate these files and determine how much bandwidth *would* be used if transfer were to happen.


So, repeating what did earlier (but not really transferring any file) suppose we have already uploaded a file, locally modified it and then want to see how many bytes would be transferred for update:



In this case, we have testblob and an updated local test.txt file. According to the estimate if we wanted to upload the new version of text.txt 10039 bytes would be uploaded.


For the scenario where we haven’t uploaded ANYTHING to Azure Blob Storage but we want to see potential savings, we can do the following:


Here we generate a signature (the exact same signature that would have been uploaded to Azure Blob Storage), called c:\temp\test.txt.sig.

So now we modify test.txt and want to check how much would need to be uploaded to Azure Blob Storage IF we were to upload. So for this we can use the ‘estimatelocal’ command.


This command takes the original signature file (which would have been normally available from Azure Blob Storage) and checks it against the modified test.txt.

In the example above it tells us that we’d just need to upload 9290 bytes to perform the update.


The ‘createsig’ and ‘estimatelocal’ commands are really just for testing various file/file-types to see how well BlobSync would work for those scenarios.