Postgres Backup

A great utility to backup Postgres directly to S3. Provides script for base backup of db and then a script for continual backup of WAL file to S3 by setting the archive parameters in the postgres.sql config file.

https://github.com/wal-e/wal-e

Server backup to S3 without having AWSAccessKey and AWSSecretKey on server

Backups are good! (http://www.flickr.com/photos/49024304@N00/47244105/)Having multiple servers dotted around with different backup solutions I wanted to consolidate on one consistent backup solution utilising S3 for the actual storage.

While researching using S3 for storage it is clear that the usual principle requires your AWSAccessKey and AWSSecretKey to be copied onto each server where uploading to S3 is required.

But as Amazon states...

IMPORTANT: Your Secret Access Key is a secret, and should be known only by you and AWS. You should never include your Secret Access Key in your requests to AWS. You should never e-mail your Secret Access Key to anyone. It is important to keep your Secret Access Key confidential to protect your account.

 Ideally I wanted a solution that didn't require the AWSAccessKey and AWSSecretKey to be copied onto every server requiring backup, increasing the likelihood of the AWS account being compromised if any server became compromised.

The solution to this comes from the Browser based uploads using POST.

This is designed to allow visitors to your web site to upload content directly to your S3 account with out going through your server and without you passing any credentials to the web browser.

It has two features that allow us to use this for server backup.

  1. A signature that is generated from the AWSAccessKey and AWSSecretKey
  2. An expiration that can be set to a long time in the future

So the solution is to created a signature that allows uploads to a bucket that doesn't expire until well into the future. Then use curl to POST to S3 using the signature.

Generating the correct policy document and curl parameters to make this happen took some time. So in case anyone would like to do this here's a bit of Ruby code to generate the curl command line...

require 'base64'
require 'openssl'
require 'digest/sha1'

aws_access_key_id = '*** AWS ACCESS KEY ***'
aws_secret_key = '*** AWS SECRET KEY'
content_type = 'application/octet-stream'
bucket = '** BUCKET ***'
acl = 'private'
key_prefix = '***FOLDER****/'

policy_document = '{
  "expiration": "2012-01-01T12:00:00.000Z",
  "conditions": [
    {"bucket": "' + bucket + '" },
    {"acl": "' + acl + '" },
    ["starts-with", "$key", ""],
    ["starts-with", "$Content-Type", ""],
  ]
}'

policy = Base64.encode64(policy_document).gsub("\n","")

signature = Base64.encode64(
    OpenSSL::HMAC.digest(
        OpenSSL::Digest::Digest.new('sha1'),
        aws_secret_key, policy)
    ).gsub("\n","")


print 'curl '
print "-F 'key=#{key_prefix}${filename}' "
print "-F AWSAccessKeyId=#{aws_access_key_id} "
print "-F acl=#{acl} "
print "-F policy=#{policy} "
print "-F signature=#{signature} "
print "-F Content-Type=#{content_type} "
print "-F file=@FILENAMEHERE "
print "-F Submit=OK "
print "http://#{bucket}.s3.amazonaws.com"
print "\n"

Just fill in the variables with the required information and run the script.

The curl command line generated will then upload any file to S3. Just replace FILENAMEHERE in the command line with the required filename (leave the @ before the filename).

Also, rate limiting is possible with curl. For example --limit-rate 20K will keep curl to only using 20K/s of your bandwidth. Handy for stopping the backup using the full bandwidth of your connection.

 

Dark Ages 2.0

This is a talk I gave at the Geek Night in Oxford. Complete Slides of the talk.

A summary of the talk.

With photographs and written word on paper backup/storage is passive, barring physical damage the content will be readable for years or even hundreds of years. With digital media backup/storage is active, if you don't keep testing it, duplicating & moving forward in media format your content will be unusable over time. Compare the box of photos and notes left in the loft for 50 years to a stack of CD-R. Will the discs be readable, will you even have a drive to read them? Chances are the paper will be. This is a problem for all of us as time passes and we collect more content in digital format.

Full Talk

The first dark age occurred around 410AD after the collapse of the Roman empire. The period last until shortly after 1000AD and historically very little is known about this period.

I think we are creating a new dark age. Not like the first one, but a dark ages for family social history, this will be caused by a loss of data.

I've been thinking about this a lot recently, as we now have a daughter. Taking photographs is now part of recording the family history for my daughters future generations. Currently I have around 8000 photos stored in iPhoto all tagged with descriptions, all of which is stored in it's proprietary database, hmmm.

This is a recent phenomenon, the problem is occurring now and the effects of it will not be seen until the future. Photographing has been around since 1850, yet it's only since 2000 that digital cameras have been in common use.

The two big issues are

  • Backups
  • Cataloging

 Backups/Storage

We have a photo of my daughter's great great great grandmother, the picture is over 123 years old. All being well, will my daughter's great great great grandchildren in 123 years time be able to view the photographs we have taken now? The work to make sure this is possible will be much more than just putting photographs into a box like our ancesters did. It's this 'active' management that is going to be the cause of the 'Dark Ages 2.0'.

Since photographs were invented around 1850 'backups' have been easy, just chuck the negatives and photos in a box. This is what most families have done and it's worked well for generations.

Now we have digital backups to carry out. There are many media options for storing backups and over time these degrade and fall out of use. We have to continually recreate backups, test them and move them forward as new media formats are developed - a lot of 'active' management. 

Just imagine a hard drive got 'stuck in the loft' like a box of photos might. After 50 years the photos would be fine. The hard drive? Even if it would spin up, would the data on the disk be readable, and with a USB connection finding a computer to plug it into might prove difficult. 50 Years ago personal computers didn't exist. The progress in the next 50 years will make it very difficult to read media from this era. I found a few old copies of Computer magazines in the back of a cupboard from around the mid 90s with floppy disks on the cover. I can still read the magazine, but I don't even own a 3.5" floppy drive any more to be able to read the disks.

This is all something as technical people we understand, but even as technical people we know we probably don't back up enough. So what chance does an 'average person on the street' stand trying to keep on top of this. I personally know of non-tech friends who have lost photos, some of very important family history moments; weddings, babies, loved ones no longer with us. It's that digital backups are an 'active' process that causes the problem.

Previously backing up/storage was a passive process, by putting things in a box it was done, they would always come out roughly in the same condition they went in. And this passive process has meant that family photos have survived down the generations, now the process is 'active' how much will survive over time?

Meta-Data

There are two options with where to store meta-data

  • In the image file
  • In a separate database

Looking at our old photographs as an example of what to do - if we're lucky somebody wrote the date the photograph was taken and who is in the photograph straight on the back of the photograph. This has stood the test of time well as it's commonly the only way we know who features in the photographs. 

Now image that instead of writing the meta-data on the back of the photograph they wrote a number. Then in a separate notebook next to that number they wrote all of the meta-data. Over time the photos and the notebook must always be kept together. What happens if the photos get shared out to different family members, would the notebook get split up or copied (by hand, unless it's in the last 30 years). Also if the notebook got lost then all of the meta-data for all of the photos would be lost for ever. Overall it doesn't sound a very good solution and thankfully most people just wrote on the back of the photographs.

Many photo software packages do store the meta-data exactly as described above, in a separate proprietary database. iPhoto is one such program. When backing up photographs the database needs to be backed up as well. They must always be kept together and in sync. iPhoto provides options for this, but in 50 years time would iPhoto 59 be able to restore a backup from iPhoto 9 with all the meta-data restored? Also what other program would be able to load this proprietary database and extract the meta-data. How much work would be required to obtain this data. Also if this one database file becomes corrupt then all of the meta-data for every photograph will be lost.

The other option is to store the meta-data within the image file; jpg or RAW (preferably though for archiving convert your RAW images to DNG as this open standard will stand much greater chance of continued support in the future). As the old adage goes in computing we like standards - that's why we have SO many of them. And so it is with image meta-data, with the following standards; EXIF, XMP, IPTC & MakerNotes. By using a combination of these standards it is possible to store all of the meta-data required about images directly within the image files.

MakerNotes throws a bit of a spanner in the works as it's internal contents is not a defined standard but a 'binary blob' of data that records all of the camera details when the photo was taken. The ISO, Lens type, shutter speed etc. Unfortunately there is no standard for this data and each manufacture has come up with their own format, and not all of them are documented. If that wasn't bad enough the data has absolute references within it, so changes to the rest of the meta-data can corrupt the MakerNotes. Picassa is one such program that helpfully stores all tags and descriptions in the EXIF/XMP headers but unfortunately does not understand MakerNotes and therefore can corrupt this information in your images files.

Two programs that do handle all meta-data within the image files and correctly handle the MakerNotes are DigiKam & Adobe Lightroom 2. DigiKam is open-source and works on Linux, Windows and Mac OS X.

I think that for the storage of photographs a set of plain file system folders with images in jpg or dng format with all meta-data contained within the images will be the most resilient and most likely to endure archival system for the future.

Even keeping to this very simple storage structure, frequent testing, duplication and transfer across media formats will be required to maintain the archives for future generations.

What's your strategy?

 

Since giving this talk an article along similar lines has been published in American Scientist called Avoiding a Digital Dark Age which is also worth a read.

Javascript - The Final Big Language FBL

JavaScript will not just be the NBL (Next Big Language) it will be the FBL - Final Big Language.

Big statement, but I think the pieces are falling into place to make this happen and I think Node.js will be a big driver of this process. It will be the driver for JavaScript server side as Rails was for driving Ruby for server side development.

Rails crystallised many great ideas in how to develop web applications and Ruby's design allowed this to be coded in a very clean way. Many of the ideas had been around for a while but it took DHH using Ruby to seed the community around a Rails. Just look how it's transformed web development over the past 6 years, and how it has influenced so many other frameworks in other languages.

I see node.js as the seed for JavaScript on the server side. OK it's lower down the stack than Rails, but it's seeded the idea of what is possible with JavaScript on the server, just look at how interest is developing. Already frameworks taking the best of Rails/Django are starting to appear running on node.js, and the performance for such young frameworks hint at what will be possible in the near future.

The crucial factor in JavaScript being the FBL is the server programming language now matches the client. Do not underestimate the impact of this. Since web development began we've moved through various languages server side...  PERL, Java, PHP, ASP, Python, Ruby, and many more. On the client side we've just had JavaScript since 1994 - 16 years! (ok Microsoft did have a go with VBScript in the browser, enough said).

Once you can develop on the server and client side in one language, unless the client side changes, it would seem unlikely on the server side you would move on to another language. As a developer why would you go from working with one common language and common set of libraries covering both server & client side to learning a separate language when you are still going to be developing in JavaScript on the client side. The benefits of a new server side language would need to be substantial to break from having one consistent language.

I'm not saying JavaScript is the 'best' language (however you define that), just that it will become very popular.

node.js will lead this, event driven server side programming that by it's nature allows very high performance (even for such a new system), and JavaScript provides a very natural environment for callback based development. Take a look at the chat example to see how the environment provides such a natural fit and the code reduction that comes from this.

JavaScript is being built into many technologies, just look at CouchDB using JavaScript for it's view language and as I've previously written, JavaScript in Yahoo's query language and many other services & obviously the browser.

How about developing apps for iPhone & Android in JavaScript. No problem just take a look at Appcelerator

The progress seems unstoppable, will we all be JavaScript developers in the future?