Optimising Azure Blob Updating Part 2

In my previous post I covered the high level theory for an efficient method of updating Azure Block Blobs. (This theory can be used against most cloud storage providers as long as their blob API’s allow partial uploads/downloads).

The implementation of what was described is now available on Github.

I’ll go through the implementation specifics in a later post, but for those who want to try out BlobSync, simply clone the github repo. Compile (I’m using Visual Studio 2013) and you’ll end up with a binary “BlobSyncCmd.exe”. Some examples of using the command are:

 

blobsynccmd upload c:\temp\myfile  mycontainer  myblob

 

This will upload the local “myfile” to Azure, in the appropriate container with the appropriate name.

Then, feel free to modify your local file and upload it again. If you use any network monitoring tools you should see a dramatic reduction in the uploaded bytes (assuming you don’t modify the entire file).

Equally you can run the command:

 

blobsynccmd download c:\temp\myfile mycontainer myblob

 

This will download the blob and reuse as much of “myfile” as it can. Currently for testing purposes it won’t replace myfile but will create myfile.new

 

More to come.

Advertisements

Small AzureCopy update.

Just a quick update to both AzureCopy executable as well as the Nuget package. Previously the Skydrive code considered “types” of elements as “files” or “folders”, catch is Skydrive labels things differently. Previously if I’d tried to copy a png from Skydrive to anywhere it would never get detected and copied due to the fact Skydrive says this isn’t a “file” but an “image”. This is now rectified in AzureCopy.

 

Enjoy.

Optimising Azure Blob Updating (Part 1)

Cloud storage these days really allows any volume of data to be geo redundantly stored, always available and at a fraction of the price of 10 years ago. This is “a good thing”. One common problem I’ve seen is the amount of bandwidth wasted when updating existing blobs. Say you have a 10M file in cloud storage, you download and modify a small section of it, how do you update the version in cloud version?

1) Upload entire file again. Wasteful of bandwidth, time and money but sadly often the solution used since it’s the easiest option.

2) Keep some internal tracking system to determine what’s been changed and what hasn’t. Then use this information to only upload the modified part of the blob.

What follows is an enhanced version of #2 which can dramatically reduce bandwidth requirements when updating blobs in Azure Blob Storage.

For those who don’t know about Azure Block Blobs, the basic idea is that blobs (files in the cloud) are broken into a number of chunks/blocks. Each block can be uploaded/downloaded/replaced/deleted individually which in turn means manipulation of the blob can be done in chunks rather than all or nothing.(see this for more details).

Anyway, back to the problem. Imagine you have a blob, you then download the blob, modify it and now want to upload the new version.

blob1

For this scenario the problem is easy. We have a blob (top row) which is made of 4×100 byte blocks. Some of the contents of the second block (between bytes 100 and 200) are replaced. The size and more importantly the offset locations of all blocks stay consistent. Determining that some of the blocks are unmodified is easy, and we simply upload the new version of the second block. Unfortunately the real world isn’t always like this. What happens when we get this situation?

blob2

In this scenario, when the “uploading program” needs to determine what blocks can be reused and what parts need to be replaced. The contents of blocks A,C and D exist in the cloud blob (top row) as well as the new version of the file (bottom row). The problem is that although contents of blocks C and D existing in the new file, their locations in the file have moved. This is the challenge, detecting that blocks in the cloud can be reused even though their location in the new blob have moved.

Now that we know the problem (data blocks are available for reuse but are in unexpected offsets) we can start searching for a solution. The approach I’ve taken is to keep some unique signatures of each block already in the cloud and then look for the same signatures (hashes) in the new version of the file which is being uploaded.

The calculations required to find the new offsets are huuuuuge, well potentially huge, well “quite large” would cover it. For each block that exists in the Azure blob we need to search at every byte offset in the new file. To put it simply, if the file is 100M in size, and we’re searching for a block that is 10M in size, then the number of comparisons required is (approx) 100 million – 10 million = 90 million.

For example:

blob3

In the above diagram, we want to determine if block C (that already exists in the cloud) also exists in the updated version of the file.

The process taken is:

0) Set offset to 0.

1) Let SizeC represent the size (in bytes) of block C.

2) Let SigC represent the unique signature of block C.

3) Read SizeC bytes (starting at offset) of the new blob into byte array.

4) Generate the signature of byte array.

5) If the new signature matches SigC then we know that when we’re uploading the new file we’re able to reuse block C!

6) If the new signature does NOT match SigC, then increment offset by 1 and jump back to step 3.

As the diagram shows, eventually we’ll end up finding block C and therefore know we do not need to upload that part of the new file. What I hope is obvious is that a LOT of signature generation needs to happen as well as lots of comparisons.

The key for the entire process to become practical is the ability to do VERY quick signature generation so the 90M calculations don’t become an issue. Enter “rolling hash signatures” (see Wikipedia for more detailed explanation). Be warned, if you Google/Bing for rolling hash, you’ll probably get some rather different results to what you were expecting. 🙂

The way rolling hash signatures are generated is essential for this process to be quick enough to be practical. There are 2 ways of generating the signature:

Firstly, you can read N bytes from a file, perform some calculation on the array of bytes and end up with your signature. Easy peasy, but “slow”.

The other option (and this is the magic) is that if you have already generated a signature for bytes 0 to 3 (for example) you can simply generate the signature for bytes 1 to 4 (ie shifting the byte array by 1) by performing a simple calculation based off the old signature.

For example:

blob4

Now, it’s not literally Sig0 – previous byte + next byte, but it’s pretty close. We’re able to calculate signatures quickly and easily, this allows us to detect common byte arrays between the new file and the existing blob.

Although I haven’t yet covered the precise algorithm used for the signature generation we now have the basic building blocks for determining what parts of an update blob actually need to be uploaded.

Steps for updating modified block based blob:

(assumption is that when blob was originally uploaded, the block signatures were also calculated and uploaded. Trivially easy thing to do)

1) Download Blob from Azure.

2) Download Block signatures from Azure.

3) Modify downloaded blob/file to hearts delight.

4) Now we need to determine which blocks that already exist in the cloud can be reused (ie we don’t need to upload that data) and which parts have been modified.

5) Loop through every block signature downloaded

5.1) Perform the rolling signature check for entire new file.

5.2) If found, make note which Azure block can be reused.

5.3) If not found, make note of which bytes are “new” and need to be uploaded.

6) You now have 2 lists; one of new bytes to upload (offset in new file) and one that contains Azure blocks that can be reused.

7) Upload the “new bytes” as their own Azure blob blocks.

8) Instruct Azure Blob Storage which blocks (new and old) can be used to construct a blob which is bitwise identical to the modified local file.

 

All of the above is implemented and available on Github and eventually on Nuget. The next post will cover how to practically use these libraries and what the future plans are.

btw, for anyone taking notes, yes the entire blob post could have been summarised as “RSync + Azure Block Blobs”…  but thought I’d flesh it out a little 🙂

99 Problems part 3

I’ve finally got around to doing some more of the fun (??) “99 Problems”. I’m using the Scala version of the problems but am solving them in C#.

Problems 13 through to 17 are now done. All fairly simple (probably due to my naive solutions).

Problem 13 was of particular interest with the problem requiring some run length encoding to be done. The problem stated is, given a list of ‘a’,’a’,’a’,’a’,’b’,’c’,’c’,’a’,’a’,’d’,’e’,’e’,’e’,’e’  calling the solution should result in an answer of 4:’a’, 1:’b’, 2:’c’, 1:’d’, 4:’e’  (ie the # of times the letter appears contiguously.

Unless the LINQ solution is obvious to me, I prefer to code up a naive foreach/while answer, confirm I understand the problem then “LINQ-ify” it. (I’ve found that so far most of the 99 problems are better solved with LINQ). My naive solution was:

 

var res = new List<Tuple<int, char>>();

var count = 0;
char ch = ' ';

foreach (var i in l)
{
if (i != ch)
{
if (ch == ' ')
{
ch = i;
count = 1;
}
else
{

res.Add(new Tuple<int, char>(count, ch));
ch = i;
count = 1;
}
}
else
{
count++;
}

}
res.Add( new Tuple<int, char>(count,ch));

Yes, it could be shortened but you get the idea. Loop through, keeping track of last char etc. Nothing earth shattering here. Once I’d completed this version I tried for ages to come up with something that was shorter and LINQ based. One thing I did NOT want to do though is use LINQ but end up with (to me) unreadable/confusing code.

Simply, I could not. If anyone else can, I’d be interested in hearing about it.

btw, looking on StackOverflow I found an answer by Jon Skeet where he basically did the same as I did (but more elegantly). So am thinking maybe there is no neat LINQ way?

AzureCopy 0.15 out!

Another day another version of AzureCopy is out (usual github, nuget, command links)

This version has dependencies updated to the latest and greatest (AWS Client lib, Azure Storage lib etc). One drawback to the updating of the AWS client lib is that now whenever we’re copying to/from S3 we need to supply the AWS Region the account is linked too. I haven’t found a way to programmatically do that (yet) so for now the regions have to be entered into the App.config (or equiv).

 

So 3 new paramers:

AWSRegion (eg. “us-west-1”).

SrcAWSRegion (source specific version)

TargetAWSRegion (target specific version).

 

I’ll try and remove the need to know this as soon as I can.

 

The other change is that copying between different Azure accounts now works properly (had broken at some stage). So now transferring between accounts is nice and quick.

Copying blobs between Azure accounts with AzureCopy

One feature in AzureCopy which was never fully implemented was being able to copy between 2 different Azure accounts while using the CopyBlob API. Everything was there *except* for the Shared Access Signature generation for the source URL. This has now been rectified. The fix has been checked into Github and will go into the next Nuget/command releases whenever they’re made.

So from a practical point of view you can now enter 2 AzureAccountKey “secrets” one for source and one for target (these are called SrcAzureAccountKey and TargetAzureAccountKey funnily enough). You can then issue the command:

 

azurecopy –i

“https://account1.blob.core.windows.net/mycontainer/myhugeblob” –o

“https://account2.blob.core.windows.net/myothercontainer/” –blobcopy

 

This will copy from account1 to account2 and use the fantastic Azure CopyBlob API which means your bandwidth will not be used (direct DC to DC copy).

Copying to Dropbox with AzureCopy

I’ve been asked a question on how to use the AzureCopy Nuget package to copy between Azure and Dropbox. Admittedly this is slightly easier to do with the AzureCopy command than with the Nuget assembly. So here’s a quick rundown on how to do this.

 

Firstly, you need to register AzureCopy (whether it’s the command or assembly).

 

AzureCopy command

For the command the process is very simple:

 

dropbox

 

You simply need to run the command: azurecopy –configdropbox

and follow the instructions.

The config file azurecopy.exe.config will be modified to include 2 new entries, DropBoxUserSecret and DropBoxUserToken. These 2 new entries are the secret/token for azurecopy.exe to be able to access your Dropbox files.

 

AzureCopy Nuget package

If you’re using the AzureCopy Nuget package, this means you’re creating a new application to access Dropbox (and other cloud storage services). To register your new app with Dropbox you’ll need to go to the registration portal. Once you’ve done that you’ll receive a DropBoxAPIKey and DropBoxAPISecret. These are yours (the developer) and shouldn’t be shared with the general public, for this reason I’ve setup the AzureCopy assembly to read these values from 2 locations. One is the app.config which is fine if you (or your company) are the only ones using your application.

The other location it reads these values from are embedded in the source code. Specifically the class APIKeys has 2 static public string member variables, DropBoxAPIKey  and DropBoxAPISecret. If the variables cannot be read from the app.config it will be read from this file. This allows the distribution of the key/secret to other parties without making it obvious what the values are. Obviously if you’re purely using the Nuget package then modifying the source is not possible and the app.config is your only option. If you prefer the source code approach then github has all you need.

Now that you’ve got this far (remember, you’ll only need to do this once) you still need to go through the steps similar to what we did above for the AzureCopy command. You’ll need to generate the Dropbox authorization url and have the user enter it into a browser and allow your new application to their Dropbox. To generate the authorization url you can simply use the static method DropboxHelper::BuildAuthorizeUrl which returns a string url.

This may sound like a lot of work but is really about 2 minutes of effort.

 

Blob Copy Code

Once you’ve registered your new app with Dropbox portal and have authorized your new app, the code to copy between Azure (for example) and Dropbox is very straight forward.

The simplest case in C# is:

(assuming the Azure credentials have already been updated in the app.config)

 

// Create an Azure “Handler” which is used to process all Azure requests.
var azureHandler = new AzureHandler();

// Create a Dropbox “handler” ditto.
var dropboxHandler = new DropboxHandler();

// full url to Azure Blob
var inputUrl = https://dummyurl.blob.core.windows.net/myazureblob&#8221;;

// url to Dropbox folder where the blob will be copied too.
var outputUrl = https://dropbox.com/mydropboxfolder/&#8221;;

// read the blob
var blob = azureHandler.ReadBlob(inputUrl);

// write the blob
dropboxHandler.WriteBlob(outputUrl, blob);

 

Hey Presto!