Exploring BlobSync in depth (aka bandwidth savings for Azure Blob Storage).

After receiving some more interest in the BlobSync project (Github and Nuget), I thought I’d go into some more depth of what the delta uploads look like and how you can really tell what BlobSync is really doing.

Firstly we’ll look at a simple example of uploading a text file, modifying it then uploading it again but this time only with the delta.

The text file I’m using is 1.4M in size, not really a situation where bandwidth savings is required but it demonstrates the point. Firstly, the original upload:

 

blobupload1

 

So for the first upload (without a previous version of the blob existing in Azure Blob Storage) the full 1.4M had to be uploaded.

Then I edited text.txt, specifically I added a few characters in the first part of the file then I remove some other characters in the lower third. The second update looks like:

blobupload2

 

So here we see the second time around we only needed to actually upload 3840 bytes. Definitely a good saving.

The question is, what *really* happened behind the scenes. To examine what happened we need an addition tool called SigExplorer (Github and binary) as well as downloading the signature files associated with the blobs.

Details of what is contained in the signature files is covered in an earlier post, but a quick explanation is that a signature file contains hashes of “chunks” from our main files. If multiple signatures match then they contain the same data and therefore can be reused. To get the signature files for the above blobs I needed to upload the first blob, get the signature file, perform the second upload and then get the updated signature file. This way I have 2 versions of the signature file for comparison. To determine the signature file name is we look into the metadata of the blobs. In this case I could see it was blob1.0.sig. Any decent Azure blob tool can find it for you, in my case I like  Azure Storage Explorer:

signature

 

In this case I retrieved the metadata for my “blob1” blob and could see the “sigurl” metadata had the value of “blob1.0.sig”. This means the signature blob for blob1 was in the same container and with the name “blob1.0.sig”.

So I downloaded that then performed the second upload as shown above, then downloaded the new sig file. This leaves me with 2 signature files, one for the original blob and one for the new one.

Now we have the data, lets use SigExplorer to look at what’s actually happening. First load SigExplorer (sigexplorer.exe) and drag the original signature file into the left panel and the updated signature into the right one.

(in my case I’ve renamed them first-upload-blob1.0.sig and second-upload-blob1.0.sig

sigexplorer1

 

Even before we start expanding the tree structures we get a number of useful facts from the initial screen. As indicated by the red arrows we know the original blob size (1.4M) and how much of the data was shared when the second upload happened. In this case 1404708 bytes were NOT required to be re-uploaded because they were already available in Azure Blob Storage!! The final red arrow clearly tells us that only 3840 bytes were required to be uploaded. So instead of uploading the full 1.4M we only uploaded a measly  3.5k. Admittedly we already knew this because when we executed the BlobSyncCmd.exe it had told us. But hey, nice to see the signature files confirming this.

The 2 tree structures (left for original, right for updated) represent the individual blocks that comprise the blob.  In the left tree we see 2000 (704). This is telling us that starting at the beginning of the file we have 704 blocks of 2000 bytes in size. The expanded tree showing 0, 2000, 4000 etc are the individual blocks and are showing the offsets in the file that they are located. Honestly the left tree is always fairly boring, the right hand tree is where the details really are. In this case again we start with blocks of 2000 bytes but we see we only have 159 of them. Then we have a single block of 11 bytes (this represents *some* of the text that I added). Then we follow with 364 blocks of 2000 in size followed by a single block of 1829 bytes. This block previously was a 2000 byte block but I had deleted 71 bytes when modifying the file. Then we continue with more blocks which are the same as the original file.

If we expand these changed blocks we can see more details:

 

sigexplorer2

 

 

In the above image we can see a bunch of green nodes. This represents blocks that have been reused, ie NOT uploaded multiple times. Green is good, green is your friend. The 2 numbers indicate where a block is located in the new blob and where it used to be in the old blob. So we can see everything up to offset 314000 basically hasn’t shifted. Offset 316000 is in black and represents a 2000 byte block that needed to be uploaded. This is due to nowhere in the old blob were the exact same 2000 bytes existing in any other block. Following that we see the single block of 11 bytes. So far then we’ve had to upload 2000+11 bytes.

We can also see that we are able to reuse more blocks but now they’ve actually moved location, for example the 2000 byte block that is now located at 318011 used to exist at location 318000 (obviously due to we’ve added in that 11 byte block previously). This is the real key to why BlobSync (and similar methods) is so effective. We can find and reuse blocks even when they’ve moved anywhere else in the blob.

Now we’ll have a quick look at where I removed data:

 

sigexplorer3

 

In this case we can see we added in a single block of 1829 bytes. This used to be a 2000 byte block but as previously mentioned I removed a few bytes. This means the old 2000 byte block couldn’t be used anymore so we had to discard that and introduce a new block.

 

In total from the signatures we can see we’ve had to upload 2000 + 11 + 1829 bytes (which  gives us the 3840 bytes we see at the top of the screen).

 

Now, it’s all well and good to show a working example with text files that aren’t really large, but what about something a little more practical. Say, a SQL Server backup file? Offsite backups are always useful and it would be great not to have to upload the entire thing every time.

 

sigexplorer-sql

 

Although not huge files we can see that we’re now getting a bit more practical. In this case I backed up a database, did a couple of modifications and backed it up again. After uploading, grabbing signature files etc we can see the results.

In this case the second upload only required 340k to go over the wire instead of the original 13.8M. That’s a pretty good saving. As we can see many of the 2000 byte blocks were broken down/replaced by different sized blocks. Eventually we’ll want to combine some of these smaller blocks into larger blocks but that’s for a future post.

 

Happy Uploading!

Advertisements

BlobSync updates.

The BlobSync library (at Github and Nuget) has been updated with some small improvements and fixes. What has also been updated is the example executable (called BlobSyncCmd) which can be compiled from the source or downloaded here.

This sample executable is really just a wrapper around the library but is itself a fairly useful tool. Just as a reminder, the theory behind BlobSync can be read in an earlier post. This post covers setting up the executable and some examples of what you can do with it.

Firstly, download the latest release (version 0.2.1 at time of writing) then uncompress it.

There are only 3 configuration values in the BlobSyncCmd.exe.config file:

– AzureAccountKey : This is the usual account key available from the Azure Portal

– AzureAccountName: again, from the portal.

– SignatureSize: This defines the granularity that the file will be uploaded to Azure Blob Storage (ABS). Each of these “blocks” can be independently replaced which is the key to being able to update existing blobs. For example, if you set SignatureSize to 10000 (10k) and you upload a file, modify a single byte in the local copy and upload it again then the approx 10k delta will be uploaded (instead of the entire file). The smaller this value, the smaller the bandwidth requirements although this does limit the overall size of the Azure blobs. This is due to Azure Block Blobs (which BlobSync uses) have a maximum of 50000 blocks. So if you define a block (signature) to be 1k in size then the maximum size of the overall blob is 50000 * 1k  == 50M.

 

BlobSyncCmd Examples:

 

I have a text file which is about 1.4M in size. I upload using BlobSyncCmd:

 

bs1

 

Given this is a brand new file all 1410366 bytes had to be uploaded.

 

Now I edit the file and modify a few lines and need to upload the file again:

 

bs2

 

Now we can see although we originally had to upload 1410366 bytes for the update we only had to transfer 9959 bytes, so a HUGE savings!

 

Suppose we have someone else who had the original file but now wants to get our latest updates, they can simply download the latest blob and update their existing OLD copy.

bs3

 

Here we can see their older version of test.txt (called test-version1.txt) is updated against what is available in Azure Blob Storage. Instead of having to download all 1.4M they only had to download 9999 bytes. Again, HUGE bandwidth savings!

 

During the upload not only is the blob itself being uploaded but another “signature” blob is being generated and associated with the main blob. This signature blob has the information which is used to determine which parts of a file can be reused and which need to be replaced for uploads and downloads. For experimentation purposes it is possible to generate these files and determine how much bandwidth *would* be used if transfer were to happen.

 

So, repeating what did earlier (but not really transferring any file) suppose we have already uploaded a file, locally modified it and then want to see how many bytes would be transferred for update:

bs4

 

In this case, we have testblob and an updated local test.txt file. According to the estimate if we wanted to upload the new version of text.txt 10039 bytes would be uploaded.

 

For the scenario where we haven’t uploaded ANYTHING to Azure Blob Storage but we want to see potential savings, we can do the following:

bs5

Here we generate a signature (the exact same signature that would have been uploaded to Azure Blob Storage), called c:\temp\test.txt.sig.

So now we modify test.txt and want to check how much would need to be uploaded to Azure Blob Storage IF we were to upload. So for this we can use the ‘estimatelocal’ command.

bs6

This command takes the original signature file (which would have been normally available from Azure Blob Storage) and checks it against the modified test.txt.

In the example above it tells us that we’d just need to upload 9290 bytes to perform the update.

 

The ‘createsig’ and ‘estimatelocal’ commands are really just for testing various file/file-types to see how well BlobSync would work for those scenarios.

Optimising Azure Blob Updating Part 2

In my previous post I covered the high level theory for an efficient method of updating Azure Block Blobs. (This theory can be used against most cloud storage providers as long as their blob API’s allow partial uploads/downloads).

The implementation of what was described is now available on Github.

I’ll go through the implementation specifics in a later post, but for those who want to try out BlobSync, simply clone the github repo. Compile (I’m using Visual Studio 2013) and you’ll end up with a binary “BlobSyncCmd.exe”. Some examples of using the command are:

 

blobsynccmd upload c:\temp\myfile  mycontainer  myblob

 

This will upload the local “myfile” to Azure, in the appropriate container with the appropriate name.

Then, feel free to modify your local file and upload it again. If you use any network monitoring tools you should see a dramatic reduction in the uploaded bytes (assuming you don’t modify the entire file).

Equally you can run the command:

 

blobsynccmd download c:\temp\myfile mycontainer myblob

 

This will download the blob and reuse as much of “myfile” as it can. Currently for testing purposes it won’t replace myfile but will create myfile.new

 

More to come.