Azure Storage Tools

After quite a number of comments about having some consistent tools to perform basic operations on Azure Storage resources (blobs, queues and tables) I’ve decided to write up a new set of tools called “Azure Storage Tools” (man I’m original with naming, I should be in marketing).

The primary aims of AST are:

  • Cross platform set of tools so there is a consistent tool across a bunch of platforms
  • Be able to do all the common operations for blobs/queues/tables.
    • eg. for blobs we should be able to create container, upload/download blobs, list blobs etc.  You get the idea.
  • Be completely easy to use, I want to be able to provide a simple tool with a simple set of parameters that commands should be fairly “guessable”

Instead of a single tool that provides blobs, queues and tables in one binary, I’ve decided to split this into 3, one for each type of resource. First cab off the rank is for blobs!  The tool/binary name is astblob (again, marketing GENIUS at work!!).

If a picture is worth a thousand words, here’s 3 thousand words worth.

 

listcontainers

Firstly, in this image we have 3 machines all running astblob, a Windows machine, Linux machine and OSX machine. Each of them are connected to the same Azure Storage container (configured through environment variables), and we’re simply asking for the containers in that Azure Storage account.

Easy enough. Now, for some more details.

listcontainercontents

Here we’re asking for the blob contents of the “temp” container. Remember Azure (like S3) doesn’t really have the concept of directories within containers/buckets. The blobs have ‘/’ in them to “fake” directories, but really they’re just part of the blob name.

So now what happens if we download the temp container?

 

download

Here we download the temp container to some place on the local filesystem. You’ll see that the blob name that had “fake” directories in it’s name has actually had the directories made for real (executive decision made there… by FAR I believe this is what is wanted). You’ll see in both Windows and the *nix environments that that “ken1” directory was made and within them (although not shown in the screen shot) the files are contained within.

AST blob is just the first tool from the AST suite to be released. The plan is that Queue and Tables will follow shortly, also for more functionality for Blobs to be released.

The download for AST is at Github and binary releases are under the usual releases link there. Binaries are generated for Windows, OSX, Linux, FreeBSD, OpenBSD and NetBSD (all with 386 and AMD64 variants) although only Windows, Linux and OSX have been tested by me personally.

The AST tools (although 3 separate binaries) will all be self contained. The ASTBlob binary is literally a single binary, no associated libs need to be copied along with it.

Before anyone comments, yes the official Azure CLI 2 handles all of the above and more but it has more dependencies (rather than a single binary) and is also a lot more complex. AST is just aimed at simple/common tasks….

Hopefully more people will find a consistent tool across multiple platforms useful!

AzureCopy Go now with added CopyBlob flag

Azurecopy (Go version) 0.2.2 has now been released. Major benefit is now that when copying to Azure we can use the absolutely AWESOME CopyBlob functionality Azure provides. This allows blobs to be copied from S3 (for example) to Azure without having to go via the machine executing the instructions (and use my bandwidth!)

An example of copying from S3 to Azure is as simple as:

azurecopycommand_windows_amd64.exe  -S3DefaultAccessID=”S3 Access ID” -S3DefaultAccessSecret=”S3 Access Secret” -S3DefaultRegion=”us-west-2″ -dest=”https://myaccount.blob.core.windows.net/mycontainer/” -AzureDefaultAccountName=”myaccount” -AzureDefaultAccountKey=”Azure key” -source=https://s3.amazonaws.com/mybucket/ –copyblob

The key thing is the –copyblob flag. This tells AzureCopy to do it’s magic!

By default AzureCopy-Go will copy 5 blobs concurrently, this is to not overload your own bandwidth, but with Azure CopyBlob feature feel free to crank that setting up using the –cc flag (eg add -cc=20) Smile

Lets GO OS crazy with AzureCopy-Go

Being able to copy from one cloud provider to another is useful, but if everything is purely serial (ie one blob at a time) the time taken to copy everything might be less than stellar. I’ve now released a new version of AzureCopy-Go (0.2.1) which allows concurrently copying of blobs. The default is 5, but using the –cc flag (concurrent copying) it can be expanded up to 1000 (arbitrary max limit). So far, so good!

Also, for this release I’ve build the binaries using the AMAZING Gox project. This allows for easy cross compiling for Go. So we now have Linux, FreeBSD, NetBSD, OpenBSD, Darwin (MacOS) and Windows binaries. For the most part we have 3 variations of each platform binary, one each for ARM, AMD64 and 386.

I knew how to get cross compiling with Go on Linux/MacOS but could never get it working on Windows (current main OS). Gox is definitely a time saver and is so damned easy to use.

Please give AzureCopy-Go 0.2.1 a try if you have any S3 <—> Azure migration needs. More features being worked on every few days.

Azurecopy (GO version) pre-release

As mentioned in previous posts, I’ve been writing a GO version of AzureCopy so people would have something that works cross platform (Linux, MacOS and Windows). Today I’ve released the first pre-release just to test the waters. It supports Windows only (simply due to not having compiled up the other platforms yet), and only supports local filesystem, S3 and Azure Blob Storage.

Baby steps.

The plan is to build for Linux,OSX, then start adding other cloud platforms. Meanwhile the original Azurecopy (Windows only, full dotnet framework) will still be developed (mainly from a Nuget/library point of view).  If you just need an executable to perform copying, then I suggest using this newer version.

Some examples of using this  newer version:

image1

In this case we’re just listing the contents of my testken123 (super secret) bucket. My AccessID and AccessSecret are passed in via command line options. The output format is in a basic tree structure (will add in a bog-standard list soon). In the above case, the top of the tree is “testken123” which is the bucket name. Under that we have 2 virtual directories (remember Azure/S3 etc do not really have directories but fake it by using / as a delimiter). In this case we see there is a blob called “ken1/test1” which treats the “ken1” part as a directory and “test1” as the blob name. Same applies for all the other results.  Simple enough.

Then we have:

image2

In this case we’re copying from my local filesystem (c:\temp\data\s3\) into the S3 bucket testken123. The console output is just to show what is going to be copied. Output will be modified to show progress.

Finally we have:

image3

That’s coping from Azure Blob Storage to S3. Same deal, basic output.

For every command it is possible to pass the “-debug” flag. This makes things VERY verbose but is extremely useful for figuring out issues.

This is just a first step, pre-release, uber new version. Please give it a go and let me know if there are any issues. The plan is to start cranking out changes pretty frequently.

0.1.0 version

AzureCopy GO

The Go version of AzureCopy is slowly making progress. So far I’ve just been focusing on local filesystem and Azure (since I can do those while offline on the train commute thanks to the Azure Storage Emulator). The next plan is for S3 integration, primarily because S3 –> Azure seems to be the big use case for the original AzureCopy.

I’m planning on frequent releases once the basic S3 code is added (hopefully within the next few days). Not all features from the original AzureCopy will be available, but will simply be focusing on 1) list content and 2) copy content. There will be a few new additions such as a “don’t overwrite” flag so copies can be continued after being stopped (has been requested by a few people).

Ofcourse, the original AzureCopy will still be developed (mainly from a Nuget packaging point of view) but if you just need a command line tool to copy (and maybe need it on multiple platforms) then this new version is probably the way to go.

Hopefully the S3 code will drop in a few days then I’ll have a first binary release for Linux, MacOS and Windows, and see how things proceed from there.

Adventures in GO!

I’ve dabbled (ok ok, writing and rewriting “hello world” many times) in Go for a few years but have never really given it a serious Go (boom boom!) But after buying a GO in Action and going through a number of great Pluralsight courses (particularly by Nigel Poulton and Mike Van Sickle ) I’ve decided to give it another crack.

Instead of going through various tutorials I’ve decided to try porting (well more likely rewriting from scratch) my AzureCopy project. The original AzureCopy is all C# running on the .NET Framework 4.*. Although I DO (well did until recently) want to get it migrated to DotNET Core I thought it would be a good chance to learn Go PROPERLY.

I’m still trying to get my head around OO in a “kinda-is, kinda-isn’t, sorta, maybe” OO language like Go. Going back to structs (ahh glory days of C/C++), interfaces and having the magic of pointers back is really giving me a nostalgia kick.

The rough outline for this AzureCopy rewrite is basically as follows:

  • Get my dev environment sorted out (currently VSCode)
  • Basic solution structure sorted, rough architecture
  • Be able to copy to/from the local filesystem to Azure Blob Storage
  • List blobs/containers in Azure
  • Add S3
  • Add DropBox
  • Add OneDrive

Really don’t think I’ll bother with Sharepoint this time around, was a bitch to maintain in the existing version.

I’m unsure what the Go support is like with those cloud providers etc. I know Azure one seems mostly there (well for the stuff I need) but I get the distinct impression it’s the poor cousin to .NET, Java, Python etc.  I’ve yet to investigate S3’s Go offerings. Hopefully if these libs aren’t in a great shape I might get a chance to finally get my name on a contributors list somewhere. Smile

I’m sure my Go will suck…  but am hoping it will get better. The new version of AzureCopy is ofcourse on Github.

Dropbox and direct links

During some refactoring of AzureCopy I’ve decided to finally add Azure CopyBlob support for Dropbox. This means that locally you can run a command to copy from Dropbox to Azure Blob Storage and none of the traffic actually goes through where AzureCopy is running, huge bandwidth/speed savings!

The catch is that it appears (I’ve NOT fully confirmed this yet) that Azure CopyBlob doesn’t like redirection URLs, which is what I was receiving from Dropbox. I was generating a “shared” URL for a particular Dropbox file which in turn generates an HTTP 302 redirection and then gives me the real URL. Azure CopyBlob doesn’t play friendly with this. The trick is to NOT generate a “shared” URL but to generate a “media” URL. Quoting from the Dropbox API documentation: “Similar to /shares. The difference is that this bypasses the Dropbox webserver, used to provide a preview of the file, so that you can effectively stream the contents of your media.

Once I made that change, hey presto, no more redirects and Azure CopyBlob is now a happy little ummm “thing”.

Upshot is now I can migrate a tonne of data from Dropbox to Azure without using up any of my own bandwidth.

woohoo Smile

Iceberg Example

In my previous post I examined how to use BlobSync to create a tool that not only uploads/downloads deltas to Azure Blob Storage (and hence saving LOTS of bandwidth), but also how to keep multiple versions in the cloud easily.

As a sample file for uploading/downloading I’ve picked the entire Sherlock Holmes collection. Big enough that it can show the benefits of dealing with deltas for bandwidth savings, but small enough that it can be easily edited (text).

Firstly, I perform the original upload.

image

Here you can see that the original sherlock file about 3.6M and for the initial upload the entire file is uploaded (indicated by the “Uploaded 3868221 bytes” message).

Then I list the blobs and it shows I only have 1 version (called “sherlock” as expected).

iceberg2

Now, I edit the sherlock file and modify a few lines here and there, and reupload it.

 

iceberg3

We can instantly see that this time the upload only transferred 100003 bytes. Which is about 2.6% of the original file size. Which is a nice saving.

Then we list the blobs associated with “sherlock” again. This time we see 2 versions:

  • sherlock 8/01/2015 11:36:09 AM +00:00
  • sherlock.v1 8/01/2015 11:36:01 AM +00:00

Here we see sherlock and sherlock.v1.  The original sherlock blob that was uploaded was renamed to sherlock.v1. The new sherlock uploaded is now the vanilla “sherlock” blob.

Note: The timestamps still need a little work. The ones displayed are when blobs were copied/uploaded. This means that sherlock.v1 doesn’t have the original timestamp when sherlock was originally uploaded but when it was copied from sherlock to sherlock.v1. But I can live with that for the moment.

Now, say I realise that I really want to have a copy of the original sherlock. The problem is that my local version has been modified. No problems, now I can tell update my local file with the contents of sherlock.v1 (remember, thats the original one I uploaded).

iceberg4

The download was 99k (again, not the 3.6M of the full file). In my case the c:\temp\sherlock is now updated to be the same as the blob sherlock.v1 (ie the original file). How can I be sure?

Well, I happen to have a spare copy of the original sherlock file on my machine (c:\temp\sherlock-orig), and you can see from my file compare (fc.exe) that the original sherlock and my newly updated local copy are the same.

Now I can upload/download deltas AND have multiple versions available to me for future reference.

So, what happens with all my backups I don’t want? Well, you can always load up any Azure Storage Explorer program and delete the blobs you don’t want. Or you can use Icerberg to prune them for you.

Say I’ve created a few more versions of sherlock.

iceberg5

But I’ve decided that I only want to keep the latest 2 backups (ignoring the most current one). ie I want to keep sherlock, sherlock.v2 and sherlock.v3.

I can issue the prune command as such:

iceberg6

Here I tell it prune all but the latest 2 backups of the sherlock blob. I list the blobs afterwards and you can indeed see that apart from the latest (sherlock) there are only the 2 latest backups.

I’m starting to look at using this for more of my own personal backups. Hopefully this may be of use to others.

Versioned backups using BlobSync

As previously described, the BlobSync library (Github, Nuget, Blog) can be used to update Azure block blobs without having to upload the entire file/blobs. It perform an intelligent delta calculation and uploads the minimal data possible.

So, what’s next?

To show possible use cases for BlobSync, this post will outline how it is easily possible to create a backup application that not only uploads the minimal data required but also keeps a series of backups so you can always restore a previously saved blob.

The broad design of the program is as follows:

  • Allow uploading (updating) of blobs.
  • Allow downloading (updating of local files) of blobs
  • Allow multiple versions of blobs to exist and prune what we don’t want.

For this I’m using Visual Studio 2013, other versions may work fine but YMMV. The version of BlobSync I’m using is the latest available at time of writing (0.3.0) and can be installed through Nuget as per any other package (for those who are new to Nuget, please see the Nuget documentation).

Of the three requirements listed above only the last one really adds any new functionality above BlobSync. For the upload/download I really am just using a couple of equivalent methods in BlobSync. For the multiple versions we need to figure out which approach to use.

What I decided on (and has been working well) is that for updating of an existing blob, the following process is used:

  • Each blob will have a piece of metadata which has the latest version number of the blob
  • On upload the existing blob is copied to another blob with the name <original blob name>.v<latest version number>. (along with paired signature blob)
  • New delta is uploaded against existing blob.

For example, say we have a blob called “myfile”. This means we also have a “myfile.0.sig” which is the paired signature blob.

When we upload a new version of myfile the following happens:

  • copy myfile to myfile.v.1
  • copy myfile.0.sig to myfile.v.1.0.sig
  • upload delta against myfile

This means that myfile is now the latest version and myfile.v.1 is the version that previously existed. If we repeat this process then again myfile will be the latest and what used to be myfile will now be myfile.v.2 and so on. It should be noted that the copying of the blobs is performed by the brilliantly useful Azure CopyBlob API which allows Azure it copy the blob itself and doesn’t require any traffic between the application and Azure Blob Storage. This is a BIG time saver!

Now that we’d have myfile, myfile.v.1 and myfile.v.2 we should also be able to use this new project to download any version of the file. More importantly be able to just download the deltas to reduce bandwidth usage (since that is the aim of the game).

So this is the high level design in mind…   you might want to look at the implementation.

BlobSync and Sigexplorer updates!

Both BlobSync (Nuget and binary release) as well as Sigexplorer have been updated with some nice improvements.

 

BlobSync now has parallel uploading of the binary deltas to Azure Blob Storage. Sounds like an obvious improvement (which I’ll continue to expand/improve) but wanted to make sure all the binary delta edge cases were working before adding tasks/threads into the mix. Currently the parallel factor is only 2 (this will be soon configurable) but it’s enough to prove it works. There have been some very tough bugs to squash since the 0.2.2 release, particularly around very small adjustments (byte or two) at the end of files being updated. These were being missed out previously, this is now fixed.

A small design change is how BlobSync uses small signatures when trying to determine how to match against new content. The problem is when we should and should NOT reuse small signatures.

For example (sorry for dodgy artwork), say we have a blob with some small signatures contained in it:

 

uupdate2

Then we extend the blob and during the update process we need to see if we have any existing signatures that can be reused in the new area:

 

uupdate3

uupdate4

The problem we have is that if these small signatures are a few bytes in size and they’re trying to find matches in the new area (yellow) there is a really good chance that they’ll get a match. After all, there are only 256 values to a byte! So what we’ll end up with is a new area that is potentially reusing a lot of small signatures instead of making a new block/signature and uploading the new data. Now strictly speaking we usually want to reuse as many signatures/blocks as we can but the problem with using so many tiny blocks is that we’ll soon fragment our blobs so much that we’ll end up not being able to update properly. Don’t forget a blob can only consist of 50000 blocks maximum.

So a rule BlobSync 0.3.0 has added is that if the byte range we’re looking at (yellow above) is greater than 1000 bytes and the block/signature we’re looking at is greater than 100 bytes then we’ll attempt to match OR  if the byte range and the signature are exactly the same size. This way we’ll hopefully reduce the level of fragmentation and only add the volume of data being uploaded by a small percentage.

 

Sigexplorer has also been improved when you want to view the signatures being generated. Instead of rendering all signatures at once in the tree structure it will simply populate the “branches” as the user clicks on them. This reduces the load time significantly and makes the entire experience much quicker.