AzureCopy 0.16.0 out!

So, a lot has changed but at the same time there aren’t many differences. The main list of changes for 0.16.0 are:

– Updated all dependent libs to latest and greatest (AWS etc etc).

– Skydrive integration has been replaced with Onedrive, which (despite what MS initially said) means a few API changes. So far, so good.

– Small modification on how Sky/One drive is configured (but still use the –configonedrive flag).

– Misc refactoring.

 

The Nuget package , Github source and complied executable  have been updated.

Advertisements

Developing with a Surface Pro 2

This is a quick follow up post of my original developing with a Surface Pro 1

After thoroughly enjoying my original Surface Pro I decided to upgrade to a Surface Pro 2 (256GB) as my main machine (both for coding and non coding). The main aim of the upgrade was to get the 8G RAM as opposed to the Pro 1’s 4G limit.

 

Wow… simply, wow!

 

Given that spend virtually all my time in Visual Studio (which can certainly be a memory/CPU hog at times), it was going to be the make-or-break application for the Surface. If it ran badly, no Surface work for me. Fortunately the machine doesn’t skip a beat. Yes, it doesn’t compile my projects as quickly as an 3Ghz i7 with 32G of RAM, but I really really don’t care about that. My main project at work is a touch under 100k lines of C# ( and a bazillion lines of JS), and it cleans and rebuilds in around 26 seconds. Given I don’t normally completely clean and rebuild every single time I compile (usually I just hit F6 for a “build”) my compile times are practically 8 seconds. I can definitely live with that. So that’s a tick for being able to handle day to day workloads.

 

My usual workload on the Surface Pro 2 is Windows 8.1 Pro (Update 1), Visual Studio 2013, SQL Server 2012, IIS Express, SSMS, Skype for Desktop, iTunes, Sublime Text, Evernote, many powershell/command prompts, SourceTree,  Github for Windows and anywhere between 5 and 100 Chrome tabs. Although I’d call myself “slightly OCD” when it comes to monitoring memory usage, I’m very happy with how things are running. Currently I’ve a system commit of around 5.4G with physical memory usage at 4.5G. Plenty of room for VS to expand and consume (all). The single biggest jump in SC and PM will happen once I start debugging the main project in VS, then the private bytes jump to almost 1G, but hey that’s a developers life….

I’ve had plenty of 8G and 16G RAM based machines previously (hell even a 32G at one stage) but I’m still consistently surprised by how much this “mere tablet/ultrabook” gets done. As for a general purpose development machine I can’t really fault it.

This doesn’t mean I’m intentionally careless about leaving non essential memory hogging processes running eg I’ll turn SQL Server on/off when required etc.

It’s compact, works great with USB based docking stations, speedy (enough) and handles all work loads that I throw at it.

 

but…

 

Although a fan of the kick stand at the beginning I have to start admitting that on a train (where I am currently) it’s not the most comfortable thing to use. Still, I mainly use it at a desk so it’s not a big problem.

BlobSync Nuget package released

After much tinkering about with Blob Updating, I’ve decided to release a Nuget package and see if anyone is interested.

The source is available on Github and the theory behind the logic used is in a previous post. What I’d like to describe here is the practical use of the newly released Nuget package “BlobSync”.

The BlobSync library is targeted to be cloud agnostic but in reality Azure Blob Storage is currently the only implementation available. So, to begin with, create a Windows Console application and add the Nuget package “BlobSync” to the reference assemblies. At the time of writing the latest (and only) version available is 0.1.0.

Next we’ll need to modify the App.config to include the Azure account information from the Azure Portal. Open the app.config and add the entries:

<appSettings>

  <add key=”AzureAccountKey” value=”Your Account Key”/>
  <add key=”AzureAccountName” value=”Your Account Name” />
  <add key=”SignatureSize” value=”100000” />
</appSettings>

 

Hopefully the key and name settings are self explanatory, but lets dig a bit deeper into “SignatureSize”.

Whenever a blob is uploaded it will be broken into “SignatureSize” sized chunks (Blocks in Azure lingo). So say I have a file that is 250000 bytes in size, this means I’ll end up with 2x100k blocks as well as a 50k block (the remaining bytes simply make up a block of the appropriate size).

This means that when we attempt to upload a modified version of the file later on, we’ll have the option of replacing 1 or more of these blocks. Now, in this particular case we wont be saving much, but this is just the beginning. Say we deal with very large files, then simply uploading 100k as opposed to uploading the entire 300M file is definitely a saving worth considering.

We also have the option of reducing the SignatureSize to something else (anywhere between 1 byte and 4M in reality). If we want finer grain replacing then we can reduce the SignatureSize to 1k (for example) but we need to remember that Azure Block blobs can only be constructed of 50000 blocks. This means that if all our blocks are 1k in size then the maximum size our block blob can be is 1k * 50000 == 50M, ie not that big. I’ve found that 100k is a good starting point.

 

Now, to get coding…

 

We’re going to use the class “BlobSync.AzureOps” for this example. You can see by intellisense that there are 6 methods of potential interest. In reality the calling code should only ever be concerned with 2 of them, “DownloadBlob” and “UploadFile”. I think the names are pretty self explanatory.

So to upload a file to Azure Blob Storage, we can do the following:


var blobSyncClient = new BlobSync.AzureOps();
blobSyncClient.UploadFile("mycontainer", "myblob",
"c:\\temp\\myfile.txt");

Assuming you have a container called “mycontainer” and a local file “c:\temp\myfile.txt” then you’ll end up with 2 blobs in Azure Blob Storage. The first will be called “myblob” and this has the same contents as “myfile.txt”. The second will be called “myblob.0.sig”, which I’ll call the Signature Blob. This signature blob contains information about “myblob” which will be used when any further uploads or downloads occur.

 

Say you now modify “c:\temp\myfile.txt” and want to update the version in the blob.

You can now execute the exact same 2 lines as before and this time the BlobSync library will perform a number of tasks:

 

1) Checks to see if a signature blob exists.

2) Downloads signature file

3) Uses the information in the signature file to determine which parts of the local file have been modified (compared to the existing blob).

4) Uploads the changes to Azure Blob Storage.

5) Generates a new signature file and uploads it.

 

Now the blob and the local file should be identical but with the minimum data transferred over the wire.

 

Downloading works pretty well much the same.

 

If you modify the local version and then decide that you want the version in the blob, you simply run the code:


var blobSyncClient = new BlobSync.AzureOps();
blobSyncClient.DownloadBlob("mycontainer", "myblob", "c:\\temp\\myfile.txt");

 

Then BlobSync library will perform the steps:

1) Checks to see if a signature blob exists.

2) Downloads signature file

3) Uses the information in the signature file to determine which parts of the local file have been modified (compared to the existing blob).

4) Downloads only those blocks from Azure Blob Storage that are not already available in the local file.

5) Reconstructs the local file based on the changes downloaded.

 

Currently BlobSync is here to reduce bandwidth requirements and isn’t optimised for quickest transfers. In reality it probably is quicker but it does not go out of its way to make downloads/uploads parallel etc. This is something I’ll be adding in soon to speed things up.

If anyone has any improvements or suggestions, please leave a comment.

Optimising Azure Blob Updating Part 2

In my previous post I covered the high level theory for an efficient method of updating Azure Block Blobs. (This theory can be used against most cloud storage providers as long as their blob API’s allow partial uploads/downloads).

The implementation of what was described is now available on Github.

I’ll go through the implementation specifics in a later post, but for those who want to try out BlobSync, simply clone the github repo. Compile (I’m using Visual Studio 2013) and you’ll end up with a binary “BlobSyncCmd.exe”. Some examples of using the command are:

 

blobsynccmd upload c:\temp\myfile  mycontainer  myblob

 

This will upload the local “myfile” to Azure, in the appropriate container with the appropriate name.

Then, feel free to modify your local file and upload it again. If you use any network monitoring tools you should see a dramatic reduction in the uploaded bytes (assuming you don’t modify the entire file).

Equally you can run the command:

 

blobsynccmd download c:\temp\myfile mycontainer myblob

 

This will download the blob and reuse as much of “myfile” as it can. Currently for testing purposes it won’t replace myfile but will create myfile.new

 

More to come.

Small AzureCopy update.

Just a quick update to both AzureCopy executable as well as the Nuget package. Previously the Skydrive code considered “types” of elements as “files” or “folders”, catch is Skydrive labels things differently. Previously if I’d tried to copy a png from Skydrive to anywhere it would never get detected and copied due to the fact Skydrive says this isn’t a “file” but an “image”. This is now rectified in AzureCopy.

 

Enjoy.

Optimising Azure Blob Updating (Part 1)

Cloud storage these days really allows any volume of data to be geo redundantly stored, always available and at a fraction of the price of 10 years ago. This is “a good thing”. One common problem I’ve seen is the amount of bandwidth wasted when updating existing blobs. Say you have a 10M file in cloud storage, you download and modify a small section of it, how do you update the version in cloud version?

1) Upload entire file again. Wasteful of bandwidth, time and money but sadly often the solution used since it’s the easiest option.

2) Keep some internal tracking system to determine what’s been changed and what hasn’t. Then use this information to only upload the modified part of the blob.

What follows is an enhanced version of #2 which can dramatically reduce bandwidth requirements when updating blobs in Azure Blob Storage.

For those who don’t know about Azure Block Blobs, the basic idea is that blobs (files in the cloud) are broken into a number of chunks/blocks. Each block can be uploaded/downloaded/replaced/deleted individually which in turn means manipulation of the blob can be done in chunks rather than all or nothing.(see this for more details).

Anyway, back to the problem. Imagine you have a blob, you then download the blob, modify it and now want to upload the new version.

blob1

For this scenario the problem is easy. We have a blob (top row) which is made of 4×100 byte blocks. Some of the contents of the second block (between bytes 100 and 200) are replaced. The size and more importantly the offset locations of all blocks stay consistent. Determining that some of the blocks are unmodified is easy, and we simply upload the new version of the second block. Unfortunately the real world isn’t always like this. What happens when we get this situation?

blob2

In this scenario, when the “uploading program” needs to determine what blocks can be reused and what parts need to be replaced. The contents of blocks A,C and D exist in the cloud blob (top row) as well as the new version of the file (bottom row). The problem is that although contents of blocks C and D existing in the new file, their locations in the file have moved. This is the challenge, detecting that blocks in the cloud can be reused even though their location in the new blob have moved.

Now that we know the problem (data blocks are available for reuse but are in unexpected offsets) we can start searching for a solution. The approach I’ve taken is to keep some unique signatures of each block already in the cloud and then look for the same signatures (hashes) in the new version of the file which is being uploaded.

The calculations required to find the new offsets are huuuuuge, well potentially huge, well “quite large” would cover it. For each block that exists in the Azure blob we need to search at every byte offset in the new file. To put it simply, if the file is 100M in size, and we’re searching for a block that is 10M in size, then the number of comparisons required is (approx) 100 million – 10 million = 90 million.

For example:

blob3

In the above diagram, we want to determine if block C (that already exists in the cloud) also exists in the updated version of the file.

The process taken is:

0) Set offset to 0.

1) Let SizeC represent the size (in bytes) of block C.

2) Let SigC represent the unique signature of block C.

3) Read SizeC bytes (starting at offset) of the new blob into byte array.

4) Generate the signature of byte array.

5) If the new signature matches SigC then we know that when we’re uploading the new file we’re able to reuse block C!

6) If the new signature does NOT match SigC, then increment offset by 1 and jump back to step 3.

As the diagram shows, eventually we’ll end up finding block C and therefore know we do not need to upload that part of the new file. What I hope is obvious is that a LOT of signature generation needs to happen as well as lots of comparisons.

The key for the entire process to become practical is the ability to do VERY quick signature generation so the 90M calculations don’t become an issue. Enter “rolling hash signatures” (see Wikipedia for more detailed explanation). Be warned, if you Google/Bing for rolling hash, you’ll probably get some rather different results to what you were expecting. 🙂

The way rolling hash signatures are generated is essential for this process to be quick enough to be practical. There are 2 ways of generating the signature:

Firstly, you can read N bytes from a file, perform some calculation on the array of bytes and end up with your signature. Easy peasy, but “slow”.

The other option (and this is the magic) is that if you have already generated a signature for bytes 0 to 3 (for example) you can simply generate the signature for bytes 1 to 4 (ie shifting the byte array by 1) by performing a simple calculation based off the old signature.

For example:

blob4

Now, it’s not literally Sig0 – previous byte + next byte, but it’s pretty close. We’re able to calculate signatures quickly and easily, this allows us to detect common byte arrays between the new file and the existing blob.

Although I haven’t yet covered the precise algorithm used for the signature generation we now have the basic building blocks for determining what parts of an update blob actually need to be uploaded.

Steps for updating modified block based blob:

(assumption is that when blob was originally uploaded, the block signatures were also calculated and uploaded. Trivially easy thing to do)

1) Download Blob from Azure.

2) Download Block signatures from Azure.

3) Modify downloaded blob/file to hearts delight.

4) Now we need to determine which blocks that already exist in the cloud can be reused (ie we don’t need to upload that data) and which parts have been modified.

5) Loop through every block signature downloaded

5.1) Perform the rolling signature check for entire new file.

5.2) If found, make note which Azure block can be reused.

5.3) If not found, make note of which bytes are “new” and need to be uploaded.

6) You now have 2 lists; one of new bytes to upload (offset in new file) and one that contains Azure blocks that can be reused.

7) Upload the “new bytes” as their own Azure blob blocks.

8) Instruct Azure Blob Storage which blocks (new and old) can be used to construct a blob which is bitwise identical to the modified local file.

 

All of the above is implemented and available on Github and eventually on Nuget. The next post will cover how to practically use these libraries and what the future plans are.

btw, for anyone taking notes, yes the entire blob post could have been summarised as “RSync + Azure Block Blobs”…  but thought I’d flesh it out a little 🙂