After receiving some more interest in the BlobSync project (Github and Nuget), I thought I’d go into some more depth of what the delta uploads look like and how you can really tell what BlobSync is really doing.
Firstly we’ll look at a simple example of uploading a text file, modifying it then uploading it again but this time only with the delta.
The text file I’m using is 1.4M in size, not really a situation where bandwidth savings is required but it demonstrates the point. Firstly, the original upload:
So for the first upload (without a previous version of the blob existing in Azure Blob Storage) the full 1.4M had to be uploaded.
Then I edited text.txt, specifically I added a few characters in the first part of the file then I remove some other characters in the lower third. The second update looks like:
So here we see the second time around we only needed to actually upload 3840 bytes. Definitely a good saving.
The question is, what *really* happened behind the scenes. To examine what happened we need an addition tool called SigExplorer (Github and binary) as well as downloading the signature files associated with the blobs.
Details of what is contained in the signature files is covered in an earlier post, but a quick explanation is that a signature file contains hashes of “chunks” from our main files. If multiple signatures match then they contain the same data and therefore can be reused. To get the signature files for the above blobs I needed to upload the first blob, get the signature file, perform the second upload and then get the updated signature file. This way I have 2 versions of the signature file for comparison. To determine the signature file name is we look into the metadata of the blobs. In this case I could see it was blob1.0.sig. Any decent Azure blob tool can find it for you, in my case I like Azure Storage Explorer:
In this case I retrieved the metadata for my “blob1” blob and could see the “sigurl” metadata had the value of “blob1.0.sig”. This means the signature blob for blob1 was in the same container and with the name “blob1.0.sig”.
So I downloaded that then performed the second upload as shown above, then downloaded the new sig file. This leaves me with 2 signature files, one for the original blob and one for the new one.
Now we have the data, lets use SigExplorer to look at what’s actually happening. First load SigExplorer (sigexplorer.exe) and drag the original signature file into the left panel and the updated signature into the right one.
(in my case I’ve renamed them first-upload-blob1.0.sig and second-upload-blob1.0.sig
Even before we start expanding the tree structures we get a number of useful facts from the initial screen. As indicated by the red arrows we know the original blob size (1.4M) and how much of the data was shared when the second upload happened. In this case 1404708 bytes were NOT required to be re-uploaded because they were already available in Azure Blob Storage!! The final red arrow clearly tells us that only 3840 bytes were required to be uploaded. So instead of uploading the full 1.4M we only uploaded a measly 3.5k. Admittedly we already knew this because when we executed the BlobSyncCmd.exe it had told us. But hey, nice to see the signature files confirming this.
The 2 tree structures (left for original, right for updated) represent the individual blocks that comprise the blob. In the left tree we see 2000 (704). This is telling us that starting at the beginning of the file we have 704 blocks of 2000 bytes in size. The expanded tree showing 0, 2000, 4000 etc are the individual blocks and are showing the offsets in the file that they are located. Honestly the left tree is always fairly boring, the right hand tree is where the details really are. In this case again we start with blocks of 2000 bytes but we see we only have 159 of them. Then we have a single block of 11 bytes (this represents *some* of the text that I added). Then we follow with 364 blocks of 2000 in size followed by a single block of 1829 bytes. This block previously was a 2000 byte block but I had deleted 71 bytes when modifying the file. Then we continue with more blocks which are the same as the original file.
If we expand these changed blocks we can see more details:
In the above image we can see a bunch of green nodes. This represents blocks that have been reused, ie NOT uploaded multiple times. Green is good, green is your friend. The 2 numbers indicate where a block is located in the new blob and where it used to be in the old blob. So we can see everything up to offset 314000 basically hasn’t shifted. Offset 316000 is in black and represents a 2000 byte block that needed to be uploaded. This is due to nowhere in the old blob were the exact same 2000 bytes existing in any other block. Following that we see the single block of 11 bytes. So far then we’ve had to upload 2000+11 bytes.
We can also see that we are able to reuse more blocks but now they’ve actually moved location, for example the 2000 byte block that is now located at 318011 used to exist at location 318000 (obviously due to we’ve added in that 11 byte block previously). This is the real key to why BlobSync (and similar methods) is so effective. We can find and reuse blocks even when they’ve moved anywhere else in the blob.
Now we’ll have a quick look at where I removed data:
In this case we can see we added in a single block of 1829 bytes. This used to be a 2000 byte block but as previously mentioned I removed a few bytes. This means the old 2000 byte block couldn’t be used anymore so we had to discard that and introduce a new block.
In total from the signatures we can see we’ve had to upload 2000 + 11 + 1829 bytes (which gives us the 3840 bytes we see at the top of the screen).
Now, it’s all well and good to show a working example with text files that aren’t really large, but what about something a little more practical. Say, a SQL Server backup file? Offsite backups are always useful and it would be great not to have to upload the entire thing every time.
Although not huge files we can see that we’re now getting a bit more practical. In this case I backed up a database, did a couple of modifications and backed it up again. After uploading, grabbing signature files etc we can see the results.
In this case the second upload only required 340k to go over the wire instead of the original 13.8M. That’s a pretty good saving. As we can see many of the 2000 byte blocks were broken down/replaced by different sized blocks. Eventually we’ll want to combine some of these smaller blocks into larger blocks but that’s for a future post.