A colleague recently asked me the best way to transfer a large amount of S3 data (> 50TB) onto Azure Blob Storage. I’d always though that if the need ever arose I’d probably be doing one of three steps:
- HDD sent to Amazon, transfer data to HDD’s. Send HDD to Azure guys and complete.
- Use some little tool that reads S3 blobs and writes to Azure Blobs.
- Write some tool if the existing ones didn’t do what I wanted.
Turns out option 1 has a couple of “gotchas”. Firstly, the cost in buying drives and getting Amazon to write the data is rather expensive. But also, Azure doesn’t provide the bulk upload feature (that I’m aware of). So scratch #1
For the second option, I’m aware that Microsoft does unofficially provides some tools to do this ( AzCopy ) but they don’t have all the options I require. Yes, I could pester the guy (umm humbly request) who maintains AzCopy to add new features, but being a coder I prefer option 3. (besides I dont have access to AzCopy source myself, so I cant extend it).
For option 3 I’ve decided to start from scratch using C#. Currently it has a number of functioning features as well as a larger number of planned features.
Currently it can copy between Azure and S3 (either direction) and can handle signed url’s (ie we dont have to have public urls). Although my primary aim is to help people move from S3 to Azure, either direction is possible.
The most basic command is:
azurecopy -i inputurl -o outputurl
This will download the blob from the input url to the local machine and then upload it to output url (should probably rename that destination url). This works, but is cumbersome. By default, it will store blobs in memory ( fine for small/medium blobs but obviously not a good idea once we talk 100’s of meg). To address this, we can modify the command to be:\
azurecopy -d “c:\temp\tempblobs” -i inputurl -o outputurl
This will download a copy of the blob into c:\temp\tempblobs. During upload it will upload from this file location. Currently it doesn’t clean up this download directory.
Another obvious shortfall is that if we’re interested in copying from one location to Azure (this particular scenario is focused on Azure), then we don’t really have a need to copy from a source url to the local machine and then from the local machine to Azure. Fortunately for this scenario Azure provides the wonderfully useful CopyBlob API. Essentially you tell Azure where the source of the blob is (S3 url) and then the destination (Azure url) and then leave it to Azure to do the copying directly. For example, we can do
azurecopy -blobcopy -i s3url -o azureurl
This will return immediately and currently does NOT check once the copy is completed (this will soon be rectified). But what this does is free up any bandwidth that would have been potentially used between the cloud environments and the local machine.
In addition to the individual copying of blobs, you can also copy many blobs in one command. As long as a directory is provided, then we’ll list all blobs in that container/bucket and copy them all.
azurecopy -i https://mys3.amazonaws.com/ -o https://testazure.windows.net/mycontainer
In this case the input url ends in ‘/’ which means list all blobs in this container/bucket.
The source for azurecopy is available here and binaries are available here
WARNING: This project was only started about 3 days ago and has “works for me”. I’ll continue developing this and enhancing as required and is free is use.