BlobSync and Sigexplorer updates!

Both BlobSync (Nuget and binary release) as well as Sigexplorer have been updated with some nice improvements.

 

BlobSync now has parallel uploading of the binary deltas to Azure Blob Storage. Sounds like an obvious improvement (which I’ll continue to expand/improve) but wanted to make sure all the binary delta edge cases were working before adding tasks/threads into the mix. Currently the parallel factor is only 2 (this will be soon configurable) but it’s enough to prove it works. There have been some very tough bugs to squash since the 0.2.2 release, particularly around very small adjustments (byte or two) at the end of files being updated. These were being missed out previously, this is now fixed.

A small design change is how BlobSync uses small signatures when trying to determine how to match against new content. The problem is when we should and should NOT reuse small signatures.

For example (sorry for dodgy artwork), say we have a blob with some small signatures contained in it:

 

uupdate2

Then we extend the blob and during the update process we need to see if we have any existing signatures that can be reused in the new area:

 

uupdate3

uupdate4

The problem we have is that if these small signatures are a few bytes in size and they’re trying to find matches in the new area (yellow) there is a really good chance that they’ll get a match. After all, there are only 256 values to a byte! So what we’ll end up with is a new area that is potentially reusing a lot of small signatures instead of making a new block/signature and uploading the new data. Now strictly speaking we usually want to reuse as many signatures/blocks as we can but the problem with using so many tiny blocks is that we’ll soon fragment our blobs so much that we’ll end up not being able to update properly. Don’t forget a blob can only consist of 50000 blocks maximum.

So a rule BlobSync 0.3.0 has added is that if the byte range we’re looking at (yellow above) is greater than 1000 bytes and the block/signature we’re looking at is greater than 100 bytes then we’ll attempt to match OR  if the byte range and the signature are exactly the same size. This way we’ll hopefully reduce the level of fragmentation and only add the volume of data being uploaded by a small percentage.

 

Sigexplorer has also been improved when you want to view the signatures being generated. Instead of rendering all signatures at once in the tree structure it will simply populate the “branches” as the user clicks on them. This reduces the load time significantly and makes the entire experience much quicker.

Exploring BlobSync in depth (aka bandwidth savings for Azure Blob Storage).

After receiving some more interest in the BlobSync project (Github and Nuget), I thought I’d go into some more depth of what the delta uploads look like and how you can really tell what BlobSync is really doing.

Firstly we’ll look at a simple example of uploading a text file, modifying it then uploading it again but this time only with the delta.

The text file I’m using is 1.4M in size, not really a situation where bandwidth savings is required but it demonstrates the point. Firstly, the original upload:

 

blobupload1

 

So for the first upload (without a previous version of the blob existing in Azure Blob Storage) the full 1.4M had to be uploaded.

Then I edited text.txt, specifically I added a few characters in the first part of the file then I remove some other characters in the lower third. The second update looks like:

blobupload2

 

So here we see the second time around we only needed to actually upload 3840 bytes. Definitely a good saving.

The question is, what *really* happened behind the scenes. To examine what happened we need an addition tool called SigExplorer (Github and binary) as well as downloading the signature files associated with the blobs.

Details of what is contained in the signature files is covered in an earlier post, but a quick explanation is that a signature file contains hashes of “chunks” from our main files. If multiple signatures match then they contain the same data and therefore can be reused. To get the signature files for the above blobs I needed to upload the first blob, get the signature file, perform the second upload and then get the updated signature file. This way I have 2 versions of the signature file for comparison. To determine the signature file name is we look into the metadata of the blobs. In this case I could see it was blob1.0.sig. Any decent Azure blob tool can find it for you, in my case I like  Azure Storage Explorer:

signature

 

In this case I retrieved the metadata for my “blob1” blob and could see the “sigurl” metadata had the value of “blob1.0.sig”. This means the signature blob for blob1 was in the same container and with the name “blob1.0.sig”.

So I downloaded that then performed the second upload as shown above, then downloaded the new sig file. This leaves me with 2 signature files, one for the original blob and one for the new one.

Now we have the data, lets use SigExplorer to look at what’s actually happening. First load SigExplorer (sigexplorer.exe) and drag the original signature file into the left panel and the updated signature into the right one.

(in my case I’ve renamed them first-upload-blob1.0.sig and second-upload-blob1.0.sig

sigexplorer1

 

Even before we start expanding the tree structures we get a number of useful facts from the initial screen. As indicated by the red arrows we know the original blob size (1.4M) and how much of the data was shared when the second upload happened. In this case 1404708 bytes were NOT required to be re-uploaded because they were already available in Azure Blob Storage!! The final red arrow clearly tells us that only 3840 bytes were required to be uploaded. So instead of uploading the full 1.4M we only uploaded a measly  3.5k. Admittedly we already knew this because when we executed the BlobSyncCmd.exe it had told us. But hey, nice to see the signature files confirming this.

The 2 tree structures (left for original, right for updated) represent the individual blocks that comprise the blob.  In the left tree we see 2000 (704). This is telling us that starting at the beginning of the file we have 704 blocks of 2000 bytes in size. The expanded tree showing 0, 2000, 4000 etc are the individual blocks and are showing the offsets in the file that they are located. Honestly the left tree is always fairly boring, the right hand tree is where the details really are. In this case again we start with blocks of 2000 bytes but we see we only have 159 of them. Then we have a single block of 11 bytes (this represents *some* of the text that I added). Then we follow with 364 blocks of 2000 in size followed by a single block of 1829 bytes. This block previously was a 2000 byte block but I had deleted 71 bytes when modifying the file. Then we continue with more blocks which are the same as the original file.

If we expand these changed blocks we can see more details:

 

sigexplorer2

 

 

In the above image we can see a bunch of green nodes. This represents blocks that have been reused, ie NOT uploaded multiple times. Green is good, green is your friend. The 2 numbers indicate where a block is located in the new blob and where it used to be in the old blob. So we can see everything up to offset 314000 basically hasn’t shifted. Offset 316000 is in black and represents a 2000 byte block that needed to be uploaded. This is due to nowhere in the old blob were the exact same 2000 bytes existing in any other block. Following that we see the single block of 11 bytes. So far then we’ve had to upload 2000+11 bytes.

We can also see that we are able to reuse more blocks but now they’ve actually moved location, for example the 2000 byte block that is now located at 318011 used to exist at location 318000 (obviously due to we’ve added in that 11 byte block previously). This is the real key to why BlobSync (and similar methods) is so effective. We can find and reuse blocks even when they’ve moved anywhere else in the blob.

Now we’ll have a quick look at where I removed data:

 

sigexplorer3

 

In this case we can see we added in a single block of 1829 bytes. This used to be a 2000 byte block but as previously mentioned I removed a few bytes. This means the old 2000 byte block couldn’t be used anymore so we had to discard that and introduce a new block.

 

In total from the signatures we can see we’ve had to upload 2000 + 11 + 1829 bytes (which  gives us the 3840 bytes we see at the top of the screen).

 

Now, it’s all well and good to show a working example with text files that aren’t really large, but what about something a little more practical. Say, a SQL Server backup file? Offsite backups are always useful and it would be great not to have to upload the entire thing every time.

 

sigexplorer-sql

 

Although not huge files we can see that we’re now getting a bit more practical. In this case I backed up a database, did a couple of modifications and backed it up again. After uploading, grabbing signature files etc we can see the results.

In this case the second upload only required 340k to go over the wire instead of the original 13.8M. That’s a pretty good saving. As we can see many of the 2000 byte blocks were broken down/replaced by different sized blocks. Eventually we’ll want to combine some of these smaller blocks into larger blocks but that’s for a future post.

 

Happy Uploading!

BlobSync updates.

The BlobSync library (at Github and Nuget) has been updated with some small improvements and fixes. What has also been updated is the example executable (called BlobSyncCmd) which can be compiled from the source or downloaded here.

This sample executable is really just a wrapper around the library but is itself a fairly useful tool. Just as a reminder, the theory behind BlobSync can be read in an earlier post. This post covers setting up the executable and some examples of what you can do with it.

Firstly, download the latest release (version 0.2.1 at time of writing) then uncompress it.

There are only 3 configuration values in the BlobSyncCmd.exe.config file:

- AzureAccountKey : This is the usual account key available from the Azure Portal

- AzureAccountName: again, from the portal.

- SignatureSize: This defines the granularity that the file will be uploaded to Azure Blob Storage (ABS). Each of these “blocks” can be independently replaced which is the key to being able to update existing blobs. For example, if you set SignatureSize to 10000 (10k) and you upload a file, modify a single byte in the local copy and upload it again then the approx 10k delta will be uploaded (instead of the entire file). The smaller this value, the smaller the bandwidth requirements although this does limit the overall size of the Azure blobs. This is due to Azure Block Blobs (which BlobSync uses) have a maximum of 50000 blocks. So if you define a block (signature) to be 1k in size then the maximum size of the overall blob is 50000 * 1k  == 50M.

 

BlobSyncCmd Examples:

 

I have a text file which is about 1.4M in size. I upload using BlobSyncCmd:

 

bs1

 

Given this is a brand new file all 1410366 bytes had to be uploaded.

 

Now I edit the file and modify a few lines and need to upload the file again:

 

bs2

 

Now we can see although we originally had to upload 1410366 bytes for the update we only had to transfer 9959 bytes, so a HUGE savings!

 

Suppose we have someone else who had the original file but now wants to get our latest updates, they can simply download the latest blob and update their existing OLD copy.

bs3

 

Here we can see their older version of test.txt (called test-version1.txt) is updated against what is available in Azure Blob Storage. Instead of having to download all 1.4M they only had to download 9999 bytes. Again, HUGE bandwidth savings!

 

During the upload not only is the blob itself being uploaded but another “signature” blob is being generated and associated with the main blob. This signature blob has the information which is used to determine which parts of a file can be reused and which need to be replaced for uploads and downloads. For experimentation purposes it is possible to generate these files and determine how much bandwidth *would* be used if transfer were to happen.

 

So, repeating what did earlier (but not really transferring any file) suppose we have already uploaded a file, locally modified it and then want to see how many bytes would be transferred for update:

bs4

 

In this case, we have testblob and an updated local test.txt file. According to the estimate if we wanted to upload the new version of text.txt 10039 bytes would be uploaded.

 

For the scenario where we haven’t uploaded ANYTHING to Azure Blob Storage but we want to see potential savings, we can do the following:

bs5

Here we generate a signature (the exact same signature that would have been uploaded to Azure Blob Storage), called c:\temp\test.txt.sig.

So now we modify test.txt and want to check how much would need to be uploaded to Azure Blob Storage IF we were to upload. So for this we can use the ‘estimatelocal’ command.

bs6

This command takes the original signature file (which would have been normally available from Azure Blob Storage) and checks it against the modified test.txt.

In the example above it tells us that we’d just need to upload 9290 bytes to perform the update.

 

The ‘createsig’ and ‘estimatelocal’ commands are really just for testing various file/file-types to see how well BlobSync would work for those scenarios.

AzureCopy 0.16.0 out!

So, a lot has changed but at the same time there aren’t many differences. The main list of changes for 0.16.0 are:

- Updated all dependent libs to latest and greatest (AWS etc etc).

- Skydrive integration has been replaced with Onedrive, which (despite what MS initially said) means a few API changes. So far, so good.

- Small modification on how Sky/One drive is configured (but still use the –configonedrive flag).

- Misc refactoring.

 

The Nuget package , Github source and complied executable  have been updated.

Developing with a Surface Pro 2

This is a quick follow up post of my original developing with a Surface Pro 1

After thoroughly enjoying my original Surface Pro I decided to upgrade to a Surface Pro 2 (256GB) as my main machine (both for coding and non coding). The main aim of the upgrade was to get the 8G RAM as opposed to the Pro 1’s 4G limit.

 

Wow… simply, wow!

 

Given that spend virtually all my time in Visual Studio (which can certainly be a memory/CPU hog at times), it was going to be the make-or-break application for the Surface. If it ran badly, no Surface work for me. Fortunately the machine doesn’t skip a beat. Yes, it doesn’t compile my projects as quickly as an 3Ghz i7 with 32G of RAM, but I really really don’t care about that. My main project at work is a touch under 100k lines of C# ( and a bazillion lines of JS), and it cleans and rebuilds in around 26 seconds. Given I don’t normally completely clean and rebuild every single time I compile (usually I just hit F6 for a “build”) my compile times are practically 8 seconds. I can definitely live with that. So that’s a tick for being able to handle day to day workloads.

 

My usual workload on the Surface Pro 2 is Windows 8.1 Pro (Update 1), Visual Studio 2013, SQL Server 2012, IIS Express, SSMS, Skype for Desktop, iTunes, Sublime Text, Evernote, many powershell/command prompts, SourceTree,  Github for Windows and anywhere between 5 and 100 Chrome tabs. Although I’d call myself “slightly OCD” when it comes to monitoring memory usage, I’m very happy with how things are running. Currently I’ve a system commit of around 5.4G with physical memory usage at 4.5G. Plenty of room for VS to expand and consume (all). The single biggest jump in SC and PM will happen once I start debugging the main project in VS, then the private bytes jump to almost 1G, but hey that’s a developers life….

I’ve had plenty of 8G and 16G RAM based machines previously (hell even a 32G at one stage) but I’m still consistently surprised by how much this “mere tablet/ultrabook” gets done. As for a general purpose development machine I can’t really fault it.

This doesn’t mean I’m intentionally careless about leaving non essential memory hogging processes running eg I’ll turn SQL Server on/off when required etc.

It’s compact, works great with USB based docking stations, speedy (enough) and handles all work loads that I throw at it.

 

but…

 

Although a fan of the kick stand at the beginning I have to start admitting that on a train (where I am currently) it’s not the most comfortable thing to use. Still, I mainly use it at a desk so it’s not a big problem.

BlobSync Nuget package released

After much tinkering about with Blob Updating, I’ve decided to release a Nuget package and see if anyone is interested.

The source is available on Github and the theory behind the logic used is in a previous post. What I’d like to describe here is the practical use of the newly released Nuget package “BlobSync”.

The BlobSync library is targeted to be cloud agnostic but in reality Azure Blob Storage is currently the only implementation available. So, to begin with, create a Windows Console application and add the Nuget package “BlobSync” to the reference assemblies. At the time of writing the latest (and only) version available is 0.1.0.

Next we’ll need to modify the App.config to include the Azure account information from the Azure Portal. Open the app.config and add the entries:

<appSettings>

  <add key=”AzureAccountKey” value=”Your Account Key”/>
  <add key=”AzureAccountName” value=”Your Account Name” />
  <add key=”SignatureSize” value=”100000” />
</appSettings>

 

Hopefully the key and name settings are self explanatory, but lets dig a bit deeper into “SignatureSize”.

Whenever a blob is uploaded it will be broken into “SignatureSize” sized chunks (Blocks in Azure lingo). So say I have a file that is 250000 bytes in size, this means I’ll end up with 2x100k blocks as well as a 50k block (the remaining bytes simply make up a block of the appropriate size).

This means that when we attempt to upload a modified version of the file later on, we’ll have the option of replacing 1 or more of these blocks. Now, in this particular case we wont be saving much, but this is just the beginning. Say we deal with very large files, then simply uploading 100k as opposed to uploading the entire 300M file is definitely a saving worth considering.

We also have the option of reducing the SignatureSize to something else (anywhere between 1 byte and 4M in reality). If we want finer grain replacing then we can reduce the SignatureSize to 1k (for example) but we need to remember that Azure Block blobs can only be constructed of 50000 blocks. This means that if all our blocks are 1k in size then the maximum size our block blob can be is 1k * 50000 == 50M, ie not that big. I’ve found that 100k is a good starting point.

 

Now, to get coding…

 

We’re going to use the class “BlobSync.AzureOps” for this example. You can see by intellisense that there are 6 methods of potential interest. In reality the calling code should only ever be concerned with 2 of them, “DownloadBlob” and “UploadFile”. I think the names are pretty self explanatory.

So to upload a file to Azure Blob Storage, we can do the following:


var blobSyncClient = new BlobSync.AzureOps();
blobSyncClient.UploadFile("mycontainer", "myblob",
"c:\\temp\\myfile.txt");

Assuming you have a container called “mycontainer” and a local file “c:\temp\myfile.txt” then you’ll end up with 2 blobs in Azure Blob Storage. The first will be called “myblob” and this has the same contents as “myfile.txt”. The second will be called “myblob.0.sig”, which I’ll call the Signature Blob. This signature blob contains information about “myblob” which will be used when any further uploads or downloads occur.

 

Say you now modify “c:\temp\myfile.txt” and want to update the version in the blob.

You can now execute the exact same 2 lines as before and this time the BlobSync library will perform a number of tasks:

 

1) Checks to see if a signature blob exists.

2) Downloads signature file

3) Uses the information in the signature file to determine which parts of the local file have been modified (compared to the existing blob).

4) Uploads the changes to Azure Blob Storage.

5) Generates a new signature file and uploads it.

 

Now the blob and the local file should be identical but with the minimum data transferred over the wire.

 

Downloading works pretty well much the same.

 

If you modify the local version and then decide that you want the version in the blob, you simply run the code:


var blobSyncClient = new BlobSync.AzureOps();
blobSyncClient.DownloadBlob("mycontainer", "myblob", "c:\\temp\\myfile.txt");

 

Then BlobSync library will perform the steps:

1) Checks to see if a signature blob exists.

2) Downloads signature file

3) Uses the information in the signature file to determine which parts of the local file have been modified (compared to the existing blob).

4) Downloads only those blocks from Azure Blob Storage that are not already available in the local file.

5) Reconstructs the local file based on the changes downloaded.

 

Currently BlobSync is here to reduce bandwidth requirements and isn’t optimised for quickest transfers. In reality it probably is quicker but it does not go out of its way to make downloads/uploads parallel etc. This is something I’ll be adding in soon to speed things up.

If anyone has any improvements or suggestions, please leave a comment.