Azurecopy and Virtual Directories.

AzureCopy has finally had some love and has been updated as per some requests that have come in. Firstly, virtual directories in S3 and Azure Blob Storage are now handled in a consistent manner.

Remember, neither S3 nor ABS really have directories. They just use blob names with the ‘/’ character in them and various tools out on the interweb use that to simulate directories.

Now, copying files between S3 and ABS has always been easy, but what if you want to utilise these virtual directories?

eg. I have a blob on S3 called “dir1/subdir1/file1” and I want to copy that to Azure (or elsewhere for that matter). But I want the destination on Azure to be in my temp container and the resulting blob to be just be called “subdir1/file1”.

This example we’re pretending to copy a subdirectory and its file from S3 to Azure. Remember, there is no spoon directory.

Now we can perform the command:

azurecopy.exe –i https://s3.amazonaws.com/mybucket/dir1/ –o https://myacct.blob.core.windows.net/temp/

The result will be in my Azure container (temp) I’ll have a blob called “subdir1/file1”.

In addition, you can now copy these blobs with virtual directories to and from Dropbox but in this case it will make/read REAL directories.

Azurecopy is available via Nuget, command line executable and source.

Advertisements

DocumentDB Stored Procedure vs User Defined Function follow up.

It was pointed out to me that in my previous post my sprocs and UDFs were performing calls to toLowerCase() and that it could be slowing down my results. After a bit of modifying the documents in my DocumentDB collection I now have 2 properties, one called “Message” which is the original text and one called MessageLower (can you guess?). Now no toLowerCase() needs to be called when doing comparisons. The change in execution time is miniscule. For the Sproc (which was fine performance wise) the average execution time went down about 3-4ms (still around 130-140ms). Now maybe for some systems 3-4ms is critical but for the system I’m working on doubling the size of the documents to get 3-4ms off isn’t a worthy trade off.

I’m definitely glad it was suggested (was on my todo list regardless) and that I tested it. At least now I know that in the big scheme of things, toLowerCase() doesn’t have *that* much of an impact.

DocumentDB Stored Procedures and User Defined Functions

Ok, time for confession. I’m not a fan of stored procedures (SProcs) within databases. I try not to be a hammer coder (one tool/language for all problems) but it’s more the fact I personally find SQL SProcs to be harder to read/develop and debug. Give me a nice C# source base any day. Now, I DO realise the proposed benefits of SProcs, do all data calculations on the server BEFORE they get returned top the application. Fewer bytes on the wire, quicker to transmit etc. BUT… when the DB and the application are co-located (both within the same Azure location) do we REALLY need to worry about data transfer between DB and app? Well, maybe. Depending if we’re talking about HUGE volumes of traffic or not. For now, I’m assuming not.

I’m not the NSA…

(or am I?) Smile with tongue out

Once I learned that DocumentDB was also introducing SProcs, I was VERY concerned that again I would get involved with a source base that has a huge volume of SProcs that would be hard to debug/deploy.

Remember, I’m highly bias AGAINST SProcs, but all my measuring/testing I’ll be doing for this blog post will be as unbiased as possible.

The simple scenario I’m looking at is searching a collection of documents for a particular term within a particular property (just to keep it easy). Each of these properties consist of 100 randomly selected words.

All of these tests are based on the compute and docdb being co-located in the same geo region.

So firstly, what the SProcs look like?

function ( searchTerm) {
    var context = getContext();
    var collection = context.getCollection();
    var response = context.getResponse();

    var returnMessages = [];

    var lowerSearchTerm;
    if (searchTerm) {
        lowerSearchTerm = searchTerm.toLowerCase();
    }
    
    GetDocuments(callback);

    function GetDocuments(callback) {
        var qry = 'SELECT c.Message FROM root c'
        var done = collection.queryDocuments(collection.getSelfLink(), qry, { pageSize: 1000 }, callback);
    }

    function callback(err, documents, responseOptions) {
        var messages = documents;

        for (var i = 0; i < messages.length; i++) {
            var message = messages[i];
            if ( message.Message && message.Message.toLowerCase().indexOf(lowerSearchTerm) > -1) {
                returnMessages.push( message);
            }
        }
          
        response.setBody(JSON.stringify(messageThreads));
    }
}

This is fairly straightforward and simple. Get all documents, throw to the callback search all the messages/documents by converting to lowercase then perform an indexOf. Simple and straight forward.

Now, my initial test data consisted of 1000 documents, 10 of which had my magical search term. The results were:

394ms
127ms
148ms
143ms
117ms

The initial query ALWAYS was far longer… assuming something is warming up/compiling/caching etc, but I thought I’d include it in the results anyway.

Ok, so 1000 docs, searching for my term, about 117-148ms for the most part. Cool, I can live with that.

Now, what about User Defined Functions? Now firstly, in case you don’t know what UDF’s are, they’re basically a snippet of Javascript which performs some functionality on a single record (to my knowledge). This Javascript can be called by using the SQL syntax when querying DocumentDB. In my case I needed to write a small UDF to search substrings within the Message property, so in this case the Javascript was:

function(input, searchTerm) {
    return input.Message.toLowerCase().indexOf( searchTerm ) > -1;
}

There are 2 ways to add UDF’s and SProcs, just as an example the way I initially added the above UDF was through code (as opposed to using a tool such as the very useful DocumentDB Studio).

private void SetupUDF(DocumentCollection collection)
{

    UserDefinedFunction function = new UserDefinedFunction()
    {
        Id = "SubStringSearch",
        Body = @"function(input, searchTerm) 
            {
                return input.toLowerCase().indexOf( searchTerm ) > -1;
            }"
    };

    var t = DocClient.CreateUserDefinedFunctionAsync(collection.SelfLink, function);
    t.Wait();
}

Once SetupUDF is called, then we’re able to use the function “SubStringSearch” via the SQL syntax.

var queryString = "SELECT r.Message FROM root r WHERE SubStringSearch( r, 'mysearchterm')";      
var results = DocClient.CreateDocumentQuery<MessageEntity>(collection.SelfLink, queryString);

Hey presto… we now have substring searching available via the SQL Syntax (of course when the DocumentDB team add a “like” type of operator, then this will not be needed). So, how did it perform?

I really had high hopes for this in comparison to the Stored Procedure. My understanding is that the SProc and UDF are “compiled” in some fashion behind the scenes and aren’t interpreted at query time. I also thought that since the UDF is called within a SQL statement which is completely run on the DocumentDB storage servers then the performance would be comparable to the SProc. I was wrong. Really wrong.

The results for the same set of documents were:

527ms
476ms
485ms
464ms
425ms

That’s 3-4 times worse than the SProc. Not what I hoped nor wanted. I’ve double checked the code, but alas the code is so small that I doubt even *I* could mess it up. I think.

So what about larger data sets? Instead of searching 1000 documents for a term that only appears in 10, what about 7500? (or more precisely 7505 since that when I got bored waiting for the random doc generator to finish)

It’s worse.

The SProc got:

420ms
101ms
139ms
106ms
137ms

Which is comparable to the results it previously got. But the UDF seems to scale linearly…  it got:

3132ms
3163ms
3259ms
15369ms
17832ms

Ignoring those last 2 entries for a moment, it looks like if I increase the document collection by 7.5 times (1000 to 7505) then my times also appear to increase by a similar factor. This was definitely unexpected.

Now, those last 2 entries are interesting. I’ve only shown one of my test runs here, but with virtually every test run performed I’d end up with one super large query time. This was due to a RequestRateTooLargeException being thrown and the LINQ provider retrying the request. Why would the UDF method be getting this and it appears that the SProc does not, even though the SProc does execute the query: “select c.Message from root c”  (ie get EVERY document)

So it looks like UDFs are slower and do not scale. Also one fact I discovered is that you can only call a single UDF per SQL query, but I’m guessing this is just an artificial limitation the DocumentDB team has enforced until the technology becomes more mature.

It is a disappointment that UDFs are not as quick (or even comparable) to the SProcs but I’m not giving up hope yet. If SProcs can be that quick, then (to my simplistic mind) I can’t see why UDF’s couldn’t be nearly as quick in the future.

As a closing note, while trawling through the fiddler traces when performing the tests I discovered some scary facts that relate to UDFs and the linear performance. When I executed the SProcs for testing I was getting Request Charges of:

628.82
627.18
629.06
617.53
629.18

 

But for the UDF approach the Request Charges were:

6911.24
6926.76
6913
7047.82
6903.06

I have not investigated further on the charges, but is certainly on my to-do list.

Conclusions to all of this? As much as I dislike SProcs in general (and business logic being able to creep into the datastore layer) I think I’ll have to continue using them.

DocumentDB is still my favourite storage option for my various projects (more features than Azure Table Storage but not as huge/crazy as Azure Database). It has its limitations, but the service/platform is young.

I’m definitely going to be re-running my tests to keep an eye on the UDF performance. One day, maybe UDF will be good enough that I can say goodbye to SProcs forever (cue party music…   fade out)

AzureCopy 0.16.0 out!

So, a lot has changed but at the same time there aren’t many differences. The main list of changes for 0.16.0 are:

– Updated all dependent libs to latest and greatest (AWS etc etc).

– Skydrive integration has been replaced with Onedrive, which (despite what MS initially said) means a few API changes. So far, so good.

– Small modification on how Sky/One drive is configured (but still use the –configonedrive flag).

– Misc refactoring.

 

The Nuget package , Github source and complied executable  have been updated.

Developing with a Surface Pro 2

This is a quick follow up post of my original developing with a Surface Pro 1

After thoroughly enjoying my original Surface Pro I decided to upgrade to a Surface Pro 2 (256GB) as my main machine (both for coding and non coding). The main aim of the upgrade was to get the 8G RAM as opposed to the Pro 1’s 4G limit.

 

Wow… simply, wow!

 

Given that spend virtually all my time in Visual Studio (which can certainly be a memory/CPU hog at times), it was going to be the make-or-break application for the Surface. If it ran badly, no Surface work for me. Fortunately the machine doesn’t skip a beat. Yes, it doesn’t compile my projects as quickly as an 3Ghz i7 with 32G of RAM, but I really really don’t care about that. My main project at work is a touch under 100k lines of C# ( and a bazillion lines of JS), and it cleans and rebuilds in around 26 seconds. Given I don’t normally completely clean and rebuild every single time I compile (usually I just hit F6 for a “build”) my compile times are practically 8 seconds. I can definitely live with that. So that’s a tick for being able to handle day to day workloads.

 

My usual workload on the Surface Pro 2 is Windows 8.1 Pro (Update 1), Visual Studio 2013, SQL Server 2012, IIS Express, SSMS, Skype for Desktop, iTunes, Sublime Text, Evernote, many powershell/command prompts, SourceTree,  Github for Windows and anywhere between 5 and 100 Chrome tabs. Although I’d call myself “slightly OCD” when it comes to monitoring memory usage, I’m very happy with how things are running. Currently I’ve a system commit of around 5.4G with physical memory usage at 4.5G. Plenty of room for VS to expand and consume (all). The single biggest jump in SC and PM will happen once I start debugging the main project in VS, then the private bytes jump to almost 1G, but hey that’s a developers life….

I’ve had plenty of 8G and 16G RAM based machines previously (hell even a 32G at one stage) but I’m still consistently surprised by how much this “mere tablet/ultrabook” gets done. As for a general purpose development machine I can’t really fault it.

This doesn’t mean I’m intentionally careless about leaving non essential memory hogging processes running eg I’ll turn SQL Server on/off when required etc.

It’s compact, works great with USB based docking stations, speedy (enough) and handles all work loads that I throw at it.

 

but…

 

Although a fan of the kick stand at the beginning I have to start admitting that on a train (where I am currently) it’s not the most comfortable thing to use. Still, I mainly use it at a desk so it’s not a big problem.

BlobSync Nuget package released

After much tinkering about with Blob Updating, I’ve decided to release a Nuget package and see if anyone is interested.

The source is available on Github and the theory behind the logic used is in a previous post. What I’d like to describe here is the practical use of the newly released Nuget package “BlobSync”.

The BlobSync library is targeted to be cloud agnostic but in reality Azure Blob Storage is currently the only implementation available. So, to begin with, create a Windows Console application and add the Nuget package “BlobSync” to the reference assemblies. At the time of writing the latest (and only) version available is 0.1.0.

Next we’ll need to modify the App.config to include the Azure account information from the Azure Portal. Open the app.config and add the entries:

<appSettings>

  <add key=”AzureAccountKey” value=”Your Account Key”/>
  <add key=”AzureAccountName” value=”Your Account Name” />
  <add key=”SignatureSize” value=”100000” />
</appSettings>

 

Hopefully the key and name settings are self explanatory, but lets dig a bit deeper into “SignatureSize”.

Whenever a blob is uploaded it will be broken into “SignatureSize” sized chunks (Blocks in Azure lingo). So say I have a file that is 250000 bytes in size, this means I’ll end up with 2x100k blocks as well as a 50k block (the remaining bytes simply make up a block of the appropriate size).

This means that when we attempt to upload a modified version of the file later on, we’ll have the option of replacing 1 or more of these blocks. Now, in this particular case we wont be saving much, but this is just the beginning. Say we deal with very large files, then simply uploading 100k as opposed to uploading the entire 300M file is definitely a saving worth considering.

We also have the option of reducing the SignatureSize to something else (anywhere between 1 byte and 4M in reality). If we want finer grain replacing then we can reduce the SignatureSize to 1k (for example) but we need to remember that Azure Block blobs can only be constructed of 50000 blocks. This means that if all our blocks are 1k in size then the maximum size our block blob can be is 1k * 50000 == 50M, ie not that big. I’ve found that 100k is a good starting point.

 

Now, to get coding…

 

We’re going to use the class “BlobSync.AzureOps” for this example. You can see by intellisense that there are 6 methods of potential interest. In reality the calling code should only ever be concerned with 2 of them, “DownloadBlob” and “UploadFile”. I think the names are pretty self explanatory.

So to upload a file to Azure Blob Storage, we can do the following:


var blobSyncClient = new BlobSync.AzureOps();
blobSyncClient.UploadFile("mycontainer", "myblob",
"c:\\temp\\myfile.txt");

Assuming you have a container called “mycontainer” and a local file “c:\temp\myfile.txt” then you’ll end up with 2 blobs in Azure Blob Storage. The first will be called “myblob” and this has the same contents as “myfile.txt”. The second will be called “myblob.0.sig”, which I’ll call the Signature Blob. This signature blob contains information about “myblob” which will be used when any further uploads or downloads occur.

 

Say you now modify “c:\temp\myfile.txt” and want to update the version in the blob.

You can now execute the exact same 2 lines as before and this time the BlobSync library will perform a number of tasks:

 

1) Checks to see if a signature blob exists.

2) Downloads signature file

3) Uses the information in the signature file to determine which parts of the local file have been modified (compared to the existing blob).

4) Uploads the changes to Azure Blob Storage.

5) Generates a new signature file and uploads it.

 

Now the blob and the local file should be identical but with the minimum data transferred over the wire.

 

Downloading works pretty well much the same.

 

If you modify the local version and then decide that you want the version in the blob, you simply run the code:


var blobSyncClient = new BlobSync.AzureOps();
blobSyncClient.DownloadBlob("mycontainer", "myblob", "c:\\temp\\myfile.txt");

 

Then BlobSync library will perform the steps:

1) Checks to see if a signature blob exists.

2) Downloads signature file

3) Uses the information in the signature file to determine which parts of the local file have been modified (compared to the existing blob).

4) Downloads only those blocks from Azure Blob Storage that are not already available in the local file.

5) Reconstructs the local file based on the changes downloaded.

 

Currently BlobSync is here to reduce bandwidth requirements and isn’t optimised for quickest transfers. In reality it probably is quicker but it does not go out of its way to make downloads/uploads parallel etc. This is something I’ll be adding in soon to speed things up.

If anyone has any improvements or suggestions, please leave a comment.