Dropbox and direct links

During some refactoring of AzureCopy I’ve decided to finally add Azure CopyBlob support for Dropbox. This means that locally you can run a command to copy from Dropbox to Azure Blob Storage and none of the traffic actually goes through where AzureCopy is running, huge bandwidth/speed savings!

The catch is that it appears (I’ve NOT fully confirmed this yet) that Azure CopyBlob doesn’t like redirection URLs, which is what I was receiving from Dropbox. I was generating a “shared” URL for a particular Dropbox file which in turn generates an HTTP 302 redirection and then gives me the real URL. Azure CopyBlob doesn’t play friendly with this. The trick is to NOT generate a “shared” URL but to generate a “media” URL. Quoting from the Dropbox API documentation: “Similar to /shares. The difference is that this bypasses the Dropbox webserver, used to provide a preview of the file, so that you can effectively stream the contents of your media.

Once I made that change, hey presto, no more redirects and Azure CopyBlob is now a happy little ummm “thing”.

Upshot is now I can migrate a tonne of data from Dropbox to Azure without using up any of my own bandwidth.

woohoo Smile

Advertisements

DocumentDB, Node.js, CoffeeScript and Hubot

For anyone that doesn’t already know, Hubot is Githubs ever present “bot” that can be customized to respond to all sorts of commands on a number of different messaging platforms. From what I understand (I don’t work at Github, so I’m just going by what I’ve read) it is used for build/deploy to production (and all other environments), determining employee locations (distributed teams) and a million other things. Fortunately Github has made Hubot open source and anyone can download and integrate it into Skype, Hipchat, Campfire, Slack etc etc. I’ve decided to have a crack at integrating it into my work place, specifically against the Slack messaging system.

I utterly love it.
During a 24 hour “hackday”, I integrated it into Slack (see details) and grabbed a number of pre-existing scripts to start me off. Some obvious ones (for a dev team) are TeamCity integration, statistics and statuses of various 3rd party services that we use and information retrieval from our own production system. This last one will be particularly useful for support, having an easy way to retrieve information about a customer without having to write up new UI’s for every change we do. *very* dev friendly Smile

One thing I’ve been tinkering with is having Hubot communicate directly with the Azure DocumentDB service. Although I’ve only put the proverbial toe in the water I see LOTS of potential here. Hubot is able to run anywhere (behind corporate firewall, out on an Azure Website or anywhere in between). Having it access DocumentDB (which can be accessed by anywhere with a net connection) means that we do not need to modify production services/firewalls etc for Hubot to work. Hubot can then perform these queries, get the statistics/details with ease. This (to me) is a big win, I can provide a useful information retrieval system without having to modify our existing production platform.

Fortunately the DocumentDB team have provided a nice Node.js npm package to install (see here for some examples). This made things trivially easy to do. The only recommendation I’d suggest is for tools/services/hubots that are read-only, just use the read only DocumentDB Key which is available on the Azure Portal. I honestly didn’t realise that read-only keys were available until I recently did some snooping about, and although I’m always confident in my code, having a read-only key just gives me a safety net against production data.

Oh yes, CoffeeScript. I’m not a Javascript person (I’m backend as much as possible, C# these days) and Hubots default language is CoffeeScript. So first I had to deal with JS and THEN deal with CoffeeScript. Yes, this last part is just my personal failing (kicking and screaming into the JS era).

An example of performing a query against DocumentDB in Node.js (in Coffeescript) follows. First you need to get a database reference, then a collection reference (from the DB) then perform the real query you want.

DocumentClient = require(“documentdb”).DocumentClient;
client = new DocumentClient( process.env.HUBOT_DOCUMENTDB_ENDPOINT, “masterKey”:process.env.HUBOT_DOCUMENTDB_READONLY_KEY} );
GetDatabase client, ‘(database) –>
  GetCollection client, database._self, ‘(collection) –>
    client.queryDocuments(collection._self, “select * from docs d where d.id = ‘testid’”).toArray   (err, res) –>
      if res && res.length > 0
        console.log(res[0].SomeData);

GetCollection = (client, databaseLink, callback) –> 
  collectionQuery = { query: ‘SELECT * FROM root r WHERE r.id=”mycollection”’};
    client.queryCollections( databaseLink, collectionQuery).toArray (err, results) –> 
      if !err
        if results.length > 0
            callback( results[0]);

GetDatabase = (client, databaseName, callback ) –>
  dbQuery = { query: ‘SELECT * FROM root r WHERE r.id=”mydatabase”’};
    client.queryDatabases(dbQuery).toArray (err, results) –> 
      if !err
        if results.length > 0  
            callback(results[0]);

Given CoffeeScript is white space sensitive and my blog editor doesn’t appear to allow me to format the code *exactly* how I need to, I’m hoping readers will be able to deduce where the white space is off.

End result is Hubot, Node.js and DocumentDB are really easy to integrate together. Thanks for a great service/library Microsoft!

Azure DocumentDB performance thoughts

Updated: Typos and clarifying collections.

I’ve been developing against Azure DocumentDB storage for over 6 months now and have to say, overall I’m impressed. It gives me more than Azure Table storage (great key/value lookup but no searching via other properties) but isn’t a 800 pound gorilla of Azure Database. For me it sits nicely between the two, giving me easy development/deployment but also lets me index which fields I like (admittedly I’m sticking with the default of “all”) and query against them.

Now, my development hasn’t just been idle curiosity with a bit of tinkering here and there, but is a commercial application that is out in the wild (although in beta) currently. It is critical that language support, tooling, performance and documentation quality is met. For the most part it has, I’m personally very happy with it and will push for us to continue using it where appropriate.

Initially DocumentDB was NOT available in the region where my Azure Web Roles/VM’s where running (during development we had Web Roles running out of Singapore but DocumentDB out of west-us). This was fine for development purposes but was a niggling concern that *when* will DocumentDB appear in Singapore? Well finally it did, and the performance change “felt” to improve.

Felt…  tricky word. I swear sometimes when I tinker with my machine it “feels” faster…  but it’s probably just mind over matter. (Personally I’d love to be involved in some medical trial where I end up with a placebo. I swear it would cure me of virtually anything… or at least I feel it would) Smile

Ahem, I digress. So it “felt” faster  once DocumentDB appeared in Singapore but I know others didn’t really notice any difference. Admittedly there are LOTS of moving parts in the application and DocumentDB is just one small cog in a big machine. Maybe I was bias, maybe I was the only one paying attention, maybe I was fooling myself? Time to crank out Visual Studio and see what lies/statistics and benchmarks will tell me.

One of our development accounts had enough data to make it mostly realistic (ie not just a tiny tiny sample of data which wouldn’t prove anything). But that was sitting in west-us…   so the benchmarks I took were slightly the reverse of what production was.

In production we have the VM/WebRole and DocumentDB in Singapore where as previously we have VM/Webrole in Singapore and DocumentDB in West-US. For the purposes of my benchmarking I’ve kept the DocumentDB in west-us (test data) and have 2 VM’s setup to do the testing. One in west-us and one in Singapore.

First, some notes about the setup. Originally we had 4 collections setup with a given DocumentDB account (for explanation of a collection, see here). The query was through the LINQ provider (using SQL syntax) with a couple of simple where conditions (company = x and userid = y type of thing). Very simple, very straight forward. The query was also only executed against one of the collections. The other collections had data but were not relevant for this query.

So, what did I find?

When the test was run on a VM in Singapore against DocumentDB in west-us, the runtime results were:

3916ms

3899ms

3904ms

3962ms

3928ms

3881ms

Giving an average of 3915ms

Where as running the same test in the west-us resulted in:

431ms

456ms

684ms

494ms

422ms

425ms

With an average of 485ms.

That’s an improvement of 88%. This really shouldn’t be a surprise, the Pacific ocean is a tad large. I bet all those packets got very soggy and slowed downWinking smile

Another change that I’ve been working on is merging our 4 collections into a single collection. It has been stressed by the DocumentDB team that collections are not tables. Regardless of this, when we setup our collections originally we did make them as if they were tables. ie a single type of entity would be stored in a single collection. Although I’ll eventually end up with just the single combined collection, during these tests all 5 collections all co-existed within the same DocumentDB account.

I’ve been modifying/copying the data from the 4 collections to a single “uber collection” which really is the way it should have been done in the first place. My only real source of confusion is when querying this combined collection how do we know what to serialize the response objects as?

ie if I perform a query and I get a mix of results (class A and class B), how do I deal with it? This really was an artificial problem. The reality is that my queries really didn’t change (that much). If I was originally querying collection 1 for results I’d always get back results serialized as a list of Class A objects. If I’m doing the same query against the combined collection I should still get the same results. The only change I did to the objects (and the query) was that in each Document stored in this combined collection I added a “DocType” property which was assigned some number (really enum). This way I could modify my query to be something like:   “….. original query…..  AND e.DocType=1”   etc.

This just gave me a little piece of mind that my queries would only return a single Document Type and that I wouldn’t have to “worry my pretty little head” over some serialization trickery later on.

So… what happened? Is a combined collection better or worse performance wise? A resounding BETTER is the answer. For the *exact* same data (just copying the documents from the 4 collections into the combined collection) and adding the DocType property I got the following results:

WebRole in Singapore with DocumentDB in west-us:

3598ms

3614ms

3624ms

3641ms

3563ms

3616ms

Giving an average of 3609ms. This is an 8% improvement.

For everything in west-us I then got:

144ms

155ms

136ms

185ms

159ms

136ms

With the average being 152ms. This is an improvement of 69%!!!!  HOW??? WHY???? (not that I’m complaining mind you). What appears to have happened is that regardless of compute vs storage location approximately 300ms has been shaved off the query time. ie The average for compute/storage in different locations went from 3915ms to 3609ms with a difference of 306ms. When we have compute and storage in the same location the averages were 485ms to 152ms, having a difference of 333ms.

I’ll be asking the DocumentDB production team for any advice/reasoning around this merely to satisfy my own curiosity but hey, not going to look a gift horse in the mouth.

When I get some time I’ll do some more tests to see if this DocType property I added somehow improve the performance. If I added that to the scenario where I had the 4 collections, would it speed things up? I can’t see how, since I’m just using it to filter document entity types and for the test when I have multiple collections I’m really only querying one of them (which has a single entity type in it). More investigations to follow…..

Iceberg Example

In my previous post I examined how to use BlobSync to create a tool that not only uploads/downloads deltas to Azure Blob Storage (and hence saving LOTS of bandwidth), but also how to keep multiple versions in the cloud easily.

As a sample file for uploading/downloading I’ve picked the entire Sherlock Holmes collection. Big enough that it can show the benefits of dealing with deltas for bandwidth savings, but small enough that it can be easily edited (text).

Firstly, I perform the original upload.

image

Here you can see that the original sherlock file about 3.6M and for the initial upload the entire file is uploaded (indicated by the “Uploaded 3868221 bytes” message).

Then I list the blobs and it shows I only have 1 version (called “sherlock” as expected).

iceberg2

Now, I edit the sherlock file and modify a few lines here and there, and reupload it.

 

iceberg3

We can instantly see that this time the upload only transferred 100003 bytes. Which is about 2.6% of the original file size. Which is a nice saving.

Then we list the blobs associated with “sherlock” again. This time we see 2 versions:

  • sherlock 8/01/2015 11:36:09 AM +00:00
  • sherlock.v1 8/01/2015 11:36:01 AM +00:00

Here we see sherlock and sherlock.v1.  The original sherlock blob that was uploaded was renamed to sherlock.v1. The new sherlock uploaded is now the vanilla “sherlock” blob.

Note: The timestamps still need a little work. The ones displayed are when blobs were copied/uploaded. This means that sherlock.v1 doesn’t have the original timestamp when sherlock was originally uploaded but when it was copied from sherlock to sherlock.v1. But I can live with that for the moment.

Now, say I realise that I really want to have a copy of the original sherlock. The problem is that my local version has been modified. No problems, now I can tell update my local file with the contents of sherlock.v1 (remember, thats the original one I uploaded).

iceberg4

The download was 99k (again, not the 3.6M of the full file). In my case the c:\temp\sherlock is now updated to be the same as the blob sherlock.v1 (ie the original file). How can I be sure?

Well, I happen to have a spare copy of the original sherlock file on my machine (c:\temp\sherlock-orig), and you can see from my file compare (fc.exe) that the original sherlock and my newly updated local copy are the same.

Now I can upload/download deltas AND have multiple versions available to me for future reference.

So, what happens with all my backups I don’t want? Well, you can always load up any Azure Storage Explorer program and delete the blobs you don’t want. Or you can use Icerberg to prune them for you.

Say I’ve created a few more versions of sherlock.

iceberg5

But I’ve decided that I only want to keep the latest 2 backups (ignoring the most current one). ie I want to keep sherlock, sherlock.v2 and sherlock.v3.

I can issue the prune command as such:

iceberg6

Here I tell it prune all but the latest 2 backups of the sherlock blob. I list the blobs afterwards and you can indeed see that apart from the latest (sherlock) there are only the 2 latest backups.

I’m starting to look at using this for more of my own personal backups. Hopefully this may be of use to others.

Versioned backups using BlobSync

As previously described, the BlobSync library (Github, Nuget, Blog) can be used to update Azure block blobs without having to upload the entire file/blobs. It perform an intelligent delta calculation and uploads the minimal data possible.

So, what’s next?

To show possible use cases for BlobSync, this post will outline how it is easily possible to create a backup application that not only uploads the minimal data required but also keeps a series of backups so you can always restore a previously saved blob.

The broad design of the program is as follows:

  • Allow uploading (updating) of blobs.
  • Allow downloading (updating of local files) of blobs
  • Allow multiple versions of blobs to exist and prune what we don’t want.

For this I’m using Visual Studio 2013, other versions may work fine but YMMV. The version of BlobSync I’m using is the latest available at time of writing (0.3.0) and can be installed through Nuget as per any other package (for those who are new to Nuget, please see the Nuget documentation).

Of the three requirements listed above only the last one really adds any new functionality above BlobSync. For the upload/download I really am just using a couple of equivalent methods in BlobSync. For the multiple versions we need to figure out which approach to use.

What I decided on (and has been working well) is that for updating of an existing blob, the following process is used:

  • Each blob will have a piece of metadata which has the latest version number of the blob
  • On upload the existing blob is copied to another blob with the name <original blob name>.v<latest version number>. (along with paired signature blob)
  • New delta is uploaded against existing blob.

For example, say we have a blob called “myfile”. This means we also have a “myfile.0.sig” which is the paired signature blob.

When we upload a new version of myfile the following happens:

  • copy myfile to myfile.v.1
  • copy myfile.0.sig to myfile.v.1.0.sig
  • upload delta against myfile

This means that myfile is now the latest version and myfile.v.1 is the version that previously existed. If we repeat this process then again myfile will be the latest and what used to be myfile will now be myfile.v.2 and so on. It should be noted that the copying of the blobs is performed by the brilliantly useful Azure CopyBlob API which allows Azure it copy the blob itself and doesn’t require any traffic between the application and Azure Blob Storage. This is a BIG time saver!

Now that we’d have myfile, myfile.v.1 and myfile.v.2 we should also be able to use this new project to download any version of the file. More importantly be able to just download the deltas to reduce bandwidth usage (since that is the aim of the game).

So this is the high level design in mind…   you might want to look at the implementation.

Azure Storage Accounts and diagnostics.

Fool me once, shame on you. Fool me twice, shame on me.

What about 4-5 times? Think that’s still definitely me I’m afraid.

 

I hit a problem today where my Azure serviceconfiguration clearly stated the AccountName (here “ABC”)

 <Setting name=”Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString” value=”DefaultEndpointsProtocol=https;AccountName=ABC;AccountKey=XXXX/>

Problem is, I was deploying from within Visual Studio 2010, and if you go into Advanced Settings in the publish screen it turns out you can override this. Somehow it had got configured with a setting from another project I’d been recently working on.

Anyways, word to the wise..   Check Visual Studio!!

 

 

 

Azure pet project.

Over the last few weeks I’ve been putting the final touches to a pet project of mine. (mentioned previously on my old blog).

Basically it’s a large in memory tree that accepts a bunch of key/value pairs and performs a depth first comparrison against the tree to see what matches. It’s all first year comp sci material but I still see definite potential on using it at work.

The last piece of the puzzle has been to “Azure-ize” it. Given the memory requirements aren’t huge (1G of RAM is fine) and the CPU isn’t really ever stressed a “small” Azure instance does the job nicely.

Then came the disappointment. I’d used Azure queus many times in the past but it was always used between various Azure compute instances. In those cases 500 ops per second weren’t an issue. The catch now is that the push/pops are now going over the internet.

GOD IT IS SLOW!!!!

I think I got a max of 2 push’s per second. Given that I originally wrote this alerting engine (as I call it) to provide a high speed matching/alerting framework this was a major major blow. I *assume* (not proved) that this is because all queue messages are copied into 3 locations before the syncronous operation returns. This ofcourse takes time.

Sooo to get around this issue I’m switching the input Azure queue to just be a REST interface (will be good enough) and the output results can still be a queue. This is mostly coded and I believe will easily out perform anything we currently have.

 

There are a number of situations where I can see this simple alert/matching framework to provide some real benefit.

1) Matching documents (k/v pairs) that are being indexed by search engines (ESP/FS4SP/Lucene etc).

2) Be used by Sharepoint as documents are uploaded and will provide a notification mechanism to the admins (if this is required).

3) Will provide a good opportunity at work to merge both Azure and Search into a real customer installation.