Sunday, May 12, 2013

We've Moved!

We've moved! 

This blog is now hosted on codetrips.com (WordPress) 


All past articles are available there, and new ones will (only) appear there - come visit us!

Sunday, January 6, 2013

Python Decorators

I've now been working in Python for more than a year and we have been doing some pretty crazy stuff, especially around decorators and authorization/permissions implementations.

Admittedly, Python decorators are one of the bits of the language that can be best defined as 'magic' and certainly remains puzzling for someone like me, used to Java strict and (I still believe, much safer) compile-time type-checking.

Anyway, sparked by this very useful article on decorators, I have decided to further explore the topic of class decorators, that in the original article was just mentioned in passing, and found a few twists - the most surprising of which has certainly been finding out that the call sequence, and the type of objects being associated with type names depends on whether the decorator annotation is followed by a parameter list.
It is worth noting that python Mocks use decorators pretty heavily, and it's easy to see why, once you browse the code below: it's heavily commented and should give enough of an idea of what goes on; it can be easily executed by just running
python fun_decorators.py



"""
@copyright: AlertAvert.com (c) 2013. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Decorators
==========

Code to try out Python decorators.

Original idea from:
http://www.brianholdefehr.com/decorators-and-functional-python

@author: Marco Massenzio (m.massenzio@gmail.com)
         Created on [2013-01-05]
"""

class DecoratorClass(object):

    def __init__(self, klass=None, **kwargs):
        """This gets called every time the @DecoratorClass annotation
        is encountered.

        @param klass: the type of the class being decorated, it will be 
                passed in if no parameters are passed in the annotation
        @type klass: type
        @param kwargs: a dictionary containing any of the named 
                parameters passed in the decorator's declaration
        @type kwargs: dict
        """

        # If the decorator is declared without parameters, 
        # the class type will be passed in
        self._name = 'default'
        self._klass = None
        if klass:
            print 'No-args decorator for: ', klass.__name__
            self._klass = klass
        else:
            print 'Args declared: ', kwargs
            self._name = kwargs.get('name')

    def __call__(self, *args, **kwargs):
        """This gets invoked every time a decorated class gets created, 
        with the actual arguments in what would appear to look like a 
        constructor call: they do not have to (and in fact, won't) match 
        the actual formal argument list of the
        constructor of the decorated class (which may not even get invoked).

        If the decorator is declared with one or more arguments 
        (see Different class below), then the first item in ``args`` will be the
        **type** of the class being decorated (which
        will not have been passed in to the __init__()).

        This gets invoked immediately after self#__init__() if the decorator is 
        declared with one or more arguments, or when the "constructor" is invoked.
        """
        print 'Decorator __call__ invoked with: ', args, kwargs
        if self._klass:
            print 'Decorated instance: ', self._klass.__name__
            return self._klass(self._name)
        else:
            print 'Setting up klass'
            self._klass = args[0]
            # Here we inject a 'class-level' static method
            self._klass.get_name = self.get_name
        return self._klass

    def get_name(self):
        return self._name


@DecoratorClass
class Decorated(object):
    def __init__(self, name):
        print '>>>>> I am being decorated! <<<<<'
        self.name = name

    def call_me(self, *args):
        print 'I\'m being called: ', self.name
        print 'These are my args: ', args

# Notice how here a 'type' name (Decorated) has been completely 
# 'hijacked' to point to a specific
# instance of a DecoratorClass object (this provides 'closure')

print 'what is Decorated here?', Decorated
# >> what is Decorated here? <__main__.DecoratorClass object at 0x7f34b71308d0>

@DecoratorClass(name='another')
class Different(object):
    def __init__(self):
        # Notice how here we are using a 'static' method 
        # that has been injected by the decorator
        self.name = Different.get_name()

    def method(self, *args):
        """This is the same implementation as Decorated#call_me(), 
        just to show how they behave differently
        """
        print 'I\'m being called: ', self.name
        print 'These are my args: ', args

    def __call__(self, *args, **kwargs):
        return 'Different called with: ', args, kwargs

# Here, Different is whatever DecoratorClass#__call__()  returned: 
# this happens to be what one would expect it to be (a Different class 
# type) but that's only because the code makes it so

print 'what is Different here? ', Different
# >> what is Different here?  

print 'Calling Decorated class constructor'
# Note how the params being passed here have no relation with the 
# constructor argument list and in fact, the name of the 'decorated' 
# class is not even given here, but in the decorator

# Due to the 'magic' of decorators, DecoratorClass#__call__() is
# instead invoked here,
# on the instance that was created at declaration
deco = Decorated(123, "Hello", foo='foo', baz='baz')

# Again, now ``deco`` here happens to be what one would expect it to 
# be (an instance of the Decorated class) but solely because the code 
# makes it so - it could have been anything, really

deco.call_me(1, 2, 3, 1)
# >> I'm being called:  default
# >> These are my args:  (1, 2, 3, 1)

print 'Creating now a Different class:'
# Here the call does actually invoke the constructor (__init__()) 
# for the Different class, and the returned object is again what one 
# would expect it to be (an instance of Different):
diff = Different()

diff.method('quart', 'naught')
# >> I'm being called:  another
# >> These are my args:  ('quart', 'naught')

# Obviously, as the Different#__call__() method is defined, we can call it too:

print diff(1, 2, 3, value='val')
# >> ('Different called with: ', (1, 2, 3), {'value': 'val'})

# If we now create an entirely different instance of a Different object,
# it will still have the same 'static' method injected by the decorator:
another_diff = Different()

print 'My name is:', another_diff.name
# >> My name is: another

Sunday, December 23, 2012

Fireside Chat with Jeff Bezos

AWS re:Invent post 3 of 3

trust me - the one on the right is Jeff Bezos

A guiding principle for Amazon is its focus on "stuff that won't change over time:" for example, it's extremely unlikely that people will ever want stuff at higher prices or slower delivery.
As an entrepreneur  one should equally focus on the 'basics' of the respective problem domain: ask yourself, "what are the key user requirements, that won't change in 3-5 years?"

If you want to innovate:
  • select people who want to innovate: this isn't for everybody!
  • have a willingness to fail and expect to be criticized.



AWS is experiencing great growth rate in enterprise, gov't and education


Thanks to the growth of online commerce and information sharing, consumers  now (rightly) feel 'entitled' to near-perfect information (eg comparison shopping).


Hence, the power is shifting from marketing to product development: businesses can no longer rely on 'less-informed' customers being bamboozled by clever advertising and marketing ploys - also, in the age of Facebook, and social media in general, word spreads quickly: this cuts both ways.


This also applies to AWS where greater transparency around resources consumption can be accessed by developers so they can optimize their code, but just as equally exposes incompetence and laziness.


At Amazon, he enables everyone on the team to 'pull the cord' - fix defects nearer the source: this is a reference to the Kaizen manufacturing principles pioneered by Toyota, where, on the assembly line, anyone is allowed (in fact, encouraged) to pull a cord, stopping the assembly line and effectively stopping production, if they see a defect that needs fixing.

This is counter-intuitive, but incredibly effective, as it fixing defects 'far from the source' (in production) is way more expensive (and may well be nigh to impossible) than during development and testing.

Operating a low-margin business is hard: there's nowhere to hide; high margins cover a lot of sins and waste (and I guess everyone of us in the audience could hear the subtitle: Oracle, HP, IBM wouldn't stand a chance, were they not able to hoodwind customers into overspending for wasteful services).


Good entrepreneurs eliminate risks, gradually, until they have a viable business: so you need a systematic way of identifying risks in your product development and distribution chain.


A question he hears often is "Can AWS go after Enterprises and startups at the same time?
Apparently the needs are very similar around low-cost, high availability of computing and storage resources.


Other areas, like certification and security, developed mostly to meet the needs of large organizations, will benefit start-ups too, at a fraction of the cost, will make them ready for the day they will no longer be "start-ups" any more, and remove the need to migrate to some more expensive, high-margin, wasteful service.

10,000 year clock

Bezos is one of the investors behind this project of having a clock that will work for 10,000 years - he made a joke about it ("yes, I know, you're all thinking 'and he seemed sane until now'"), but he undertook it as a symbol for long-term planning.
When the focus is on such a long timescale horizon, we can accomplish things that would not be otherwise feasible.
He sees it as a means to change our way of thinking to foresee and avoid dangerous activities.

Blue Origin

Another of his projects, this is the design and production of a vertical take-off reusable space vehicle: the goal is to make space travel (or, at least, sub-orbital trips) affordable; the only way to do so is to have a re-usable vehicle.

They are currently at the 3rd development iteration, and on course for sub-orbital trips.

Of course, they leveraged AWS for lots of simulations, requiring massive computing power.

Saturday, December 8, 2012

AWS re:Invent - my notes (post 2 of 3)


Note - the first post (Notes about the AWS JDK) can be found here

The following are the (minimally edited) notes I've taken while attending the session at AWS re:Invent conference in Las Vegas, NV on 27-30 November, 2012.


Keynote speech by Andy Jassy (Sr. Vice President, Amazon Web Services)

It is clear we are still at the 'dawn age' in cloud computing and PaaS in general: although AWS provides a rich API and lots of flexibility, if one draws a parallel with the evolution of programming languages, choosing which type of instance one should run one's own webapp, is a bit like having to do 'register allocation' before compilers came round.


Considering the rise of mobile computing and the obvious fit between thin mobile client and cloud deployments, cloud-backed 'Personal computing'  is the next trend.

There has been great emphasis (later even more strongly stated by Jeff Bezos) on the strategic difference between low-volume/high-margin and high-volume/low-margin model for the provision of public cloud services, with AWS firmly in the latter camp.

Reed Hastings (Netflix CEO) made an appearance and a testimonial about how AWS enable them to grow 1,000x over 4 years to 1BN hours/month watched streaming: this would have not even been thinkable on a self-owned data centre, let alone affordable.

Having access to a vast (and continually evolving) set of infrastructure services has enabled Netflix Engineering team to focus on 'higher order' problems and concentrate their efforts on their core business.

He also announced that 'House of Cards' w/ Kevin Spacey is coming out in Feb!

AWS is all about innovation

82 significant innovations introduced in 2012
158 new services

Amazon Redshift
"Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift offers you fast query performance when analyzing virtually any size data set using the same SQL-based tools and business intelligence applications you use today." 
http://aws.amazon.com/redshift


They have conducted an experiment with a subset of Amazon Retail data, 10x faster queries on 2BN rows

SAPSAP HANA One on AWS at $ 0.99/hr Dev Edition
Also 'test drive' to try it out


Data Planner
Integrated with other AWS services
Uses UI to drag-n-drop components
Connect logs generated to S3, EMR
It has scheduler and pre-conditions
They use the same 'pipeline' term
Can run 'bash' scripts stored in S3

Screenshot of the demo

Update: This has now been officially launched as the Data Pipeline, which is an exciting validation of SnapLogic's concept of integration flows as Snaps and Pipelines (see here)

... and low prices

Just announced an average 19% reduction on EC2 prices, and here at re:Invent Sassy just announced a 24-27% reduction on the price of S3 storage.

Key strategic objective is to facilitate Enterprise migration to cloud
Amazon VPC + Direct Connect + Route 53
Elastic Load Balancing
Dynamo DB is the fastest growing service in the history of AWS

Enterprise on AWS, no longer an 'if' but more a matter of 'how far'



Java AWS SDK for Eclipse

Web UI frontend
EC2 back-end workers
S3 for the data storage
DynamoDB for metadata
SQS to distribute workload among workers


Suitable for CPU intensive apps, where processing must not be executed in the context of the Web UI

All this can be done via Eclipse AWS SDK
  • Can manage multiple aws accounts
  • Creates sample code working out of the box
  • Explorer view to manage all supported services

In particular, you can see all S3 buckets, and explore its contents (including virtual directories); drag-and-drop works to/from the local filesystem
(note after upload, only owner can see -- web ui can't access: right click and set permission)


Explorer shows the tables on DynDb, create new tables, edit r/w capacity
DynDb editor shows just a page of results by default; can add a 'scan condition' to filter only a certain subset
Changes must be 'saved' to the actual db (using std 'save' cmd)


Can also execute remote debugging against a running instance.
(see my other post about Java SDK)

CloudWatch


Can monitor and set thresholds for alarms/notifications
   alerts can also take actions to scale up instances
Custom metrics (uses a ReST PUT API) -- also can use scripts (search on G)


Stores 2-weeks worth of information, data can be pulled and pushed into long-term storage, analysis, map-reduce...

Asperatus (3rd party lib to push metrics)
Aws-java-sdk
Integrates with logging and JMX
Convention over configurarion


Reports metrics by instanceId/AppName
Can also use a Logger (reports class, err msg)
Easy JMX integration


Deployed on the front-end to measure 'true' latency perceived by clients

BigData with Spark/Shark

AMP Lab at Berkley Uni: htpp://amplab.cd.berkley.edu

see Mesos -- cluster virtualization mgr (Twitter uses it for 2,500 VMs in prod)
       BlinkDB -- approximate querying system (ML)


Spark is a fast, distributed MapReduce-like engine (using in-memory storage)
General execution graphs
Supports HDFS/S3/etc

Supports Scala/Java (soon Python)

messages = spark.textfile("hdfs://...")
errors = messages.filter(_.startsWith("ERROR")


creates an RDD (resilient data distribution? )

Shark -- port of Apache Hive to Spark
Compatible with existing Hive meta store and HDFS data
Dynamic join algo selection (done in real-time at query time)
About 100 times faster than Hive on Hadoop, even unstructured data
   with GROUP BY (mandatory? ) takes a bit longer, but still 50x


CloudFront (AWS Content Distribution Network)
38 edge locations globally
When benchmarked for latency, CloudFront is 1st or 2nd v. Other CDN in US and other regions

Uses S3 for caching of content, with a Load Balancer in front, allows deployment to multiple Availability Zones (AZs) to offer HA.
Index service separate from Storage

Version Upgrade (the `new` way)

It used to be the case that, to upgrade a cluster of web servers, all sitting behind a LB, one would follow the usual pattern:
  1. light up a couple of 'canaries' with the new version and start routing some traffic to them; check for catastrophic (and not-so-catastrophic) failures, bugs, etc.
  2. once the 'canaries' survive, start deploying new instances with the new release and turn off the ones with the old - always keeping an eye on traffic, failures, latency, etc.
  3. if all goes well, one has managed a pretty serious upgrade cycle (with possibly thousands of nodes) without any downtime at all, and the users barely noticing a thing (if at all).
Awesom - the real problem is when, halfway through the upgrade, it turns out that a rollback is needed: at this point, you have a large number of servers, all in flux, without an easy way to tell which ones have the 'new' release, and which ones the 'old' - this could be a real operational nightmare.

Turns out that, at current prices, it's a lot more cost-effective to just:
  1. again, use 'canaries' (if you are doing cloud deployments and don't believe in staging, about time to start looking for a new job);
  2. leave the existing cluster alone, however, and just deploy an entire new one with the new release - once ready, just 'flip the switch' on your ELB and the new cluster starts serving traffic;
  3. anything serious happens, 'flip the switch' back, and you're back in the stable configuration;
  4. rinse and repeat, until the new release is stable and serving traffic within the operational parameters (but, really, if you have to do this more than twice, it's about time to find yourself a better dev team)
  5. Keep the 'old version' cluster around for a few days, just in case: the expenditure is minimal, and infinitely worth it if an emergency rollback is necessary.
Don't make things more complex than they need to be
Design Architecture with your customers in mind, then use 'late binding' to pick which infrastructure serves you best

Deploying Python apps using BeanStalk

The python app can be developed as usual, without any change necessary.
Recommended the use of virtualenv
  virtualenv ./venv1
  source ./venv1/bin/activate (as usual) 
To create the reqts  use
pip freeze >requirements.txt
  pip -r requirements.txt


eb is a command-line tool to manage beanstalk
   
eb init
To create the env - generates the info to create it then use
    eb start
to actually deploy and get it start.
    eb status
to check status.
There is also an ini file that is read in ~/.elasticbeanstalk


EB uses also git integration to push the source/binary to the cloud:
  git aws.push

Options can be found in: .elasticbeanstalk/optionsetting
It also allows to drive AWS AutoScaling triggers to start new EC2s (behind a LB); look for:

.ebextensions/python.config

Overall it seemed like an interesting tool (the talk really went too fast to take any meaningful notes), I recommend reading the documentation here.

Sunday, December 2, 2012

Four Principles of Cloud Computing

From Werner Vogels (Amazon.com CTO) keynote speech at re:Invent:

  1. Controllable
    Architect with cost in mind

  2. Resilient
    Don't treat failure as an exception

  3. Adaptive
    Make no assumptions

  4. Data Driven
    Put everything in logs


AWS re:Invent Notes (1 of 3): SDK for Java


The following are my (only minimally edited) notes from the AWS re:Invent session (I highly recommend watching the recordings of the keynote sessions)

There are two usage levels in the SDK:
  1. Low level access to AWS system APIs; and
  2. Higher level facilities, that make it easier to access the underlying services.
Most notably, the SDK provides support for S3, DynamoDB, Policy API, Flow (SWF).


---
S3 TransferManager
  Support async transfers
  It is also a good example of hiding complexity from the developer, providing a clean interface.


Use ClasspathPropertiesFileCredentialsProvider#getCredentials to manage credentials (AWSCredentialsProvider interface)
  One can also load credentials from the EC2 instance that is running the web app
  There are a number of "providers" that cover most use cases, but if you don't like any of those, you can always implement your own.

TransferManager manages the pool of connections and is thread-save, so it's best used as a Singleton; the main method to use is upload( ) with a PutObjectRequest( )object, created on the fly.

The main stuff is in the Transfer i/f: very useful is the progress listener, which has one callback progressChanged(ProgressEvent ev)
ProgressEvent
will give you info about bytesTransferred (and a bunch of other info)

You can hook the listener into the Request, get the progress from the Upload object and then display progress (eg in a progress bar).
---
S3 Encryption Client

Client-side, industrial-strength encryption facility, it is the one used by AWS internally.

AmazonS3EncryptionClient 
Encrypts data on the fly, you can then use the standard S3 API, no other changes needed

The key management is a bit more complex than just straight encryption using a symmetric shared key, in order to make the whole process much more efficient and provide an easy way to update private keys should any one get compromised, without having to re-encrypt all the data.

Application holds the Data to encrypt and the MasterKey; the SDK generates a a one-time symmetric envelope key (see picture below): the Envelope Key is encrypted with the Master Key, while the data is encrypted with the Envelope Key.



Not only symmetric encryption with the Envelope Key is faster, but, should the Master Key be compromised, a new key can be generated, all the envelope keys decrypted with the old (now compromised) one and re-encrypted with the new (secure) one.

Clearly this process does not require any change to the encrypted data (whose size is likely to be much bigger than the envelope key) and thus much more efficient.

For more details see the article here.
---
DynamoDB

(Note this is very similar to Android SDK, see further below)

AmazonDynamoDBClient(credentialsProvider)

It is sometimes useful to versionize the entities saved on DynamoDB so as to avoid concurrent changes:

@DynamoDBVersionAttribute
public Long getVersion( ) {
  return version;
}


Will throw when there are concurrent changes on the same object done by some other thread
The SDK is compatible with Android SDK: so the same code, with the same annotations will "just work" when used in an Android client.

It also supports batch updates


---
The SDK uses Apache Commons logging, so it's really easy to view client logs Eclipse's console by turning on 'wire logging' in log4j.properties:

log4j.logger.httpclient.wire=DEBUG
For (much) more detail, see the article here.


AWS SDK for Android

In addition to the AWS Android SDK, to enable support for dynamoDb, just add the relevant jar (installing the SKD via Eclipse will do it for you).

AmazonDynamoDBClient   is the main class, along with the AttributeValue one

Use the DyanmoDBMapper to support simple CRUD operations, as it provides both query/scan capabilities to find items and batch operations (but those are not transactional, so can have 'partial success')

DynamoDB support in Java is modeled after JPA, in that it uses annotations (but it does not conform to JPA annotations):

@DynamoDBTable
public class UserPreferences {
  ....
}

// to define the hash key for the table
@DynamoDBHashKey
public int getUserId() {
   return id;
}

@DynamoDBAttribute
// not strictly necessary for 'standard' getter/setters
// as they will be automatically marked for inclusion
public String getLastName() {
  return lastName;
}

@DynamoDBIgnore
// for fields we don't need to persist (or are computed)

@DynamoDBVersionAttribute
// flag one attribute as a version number
To track if data was changed while we were trying to save it, and an Exception will be thrown is another client has updated the row since we last read it, and when we attempt to push an update.

For fields that cannot be trivially serialized or need custom treatment before being written to the db, use `marshalling`:

@DynamoDBMarshalling(marshallerClass = Conv.class)
public ComplexType getComplexValue() { return myComplexValue; }

Then define a class that implements the DynamoDbMarshaller interface:

public class Conv implements DynamoDbMarshaller {
  public String marshall(ComplexType val) {
    // do something with val and convert to string
    return valAsString;
  }
  public ComplexType unmarshall(Class clazz, String complexAsString) {
    // do something like:
    ComplexType res = ComplexType.parse(complexAsString);
    return res;
  }
}

Read/Write behavior -- supports eventual consistency
Table Override -- supports dev/prod environments

Prune local tracking of stale remote branches

If you suffer from OCD like myself, you'll likely be just as annoyed by the output of 

git branch -a

which (in a well-used local repo) after a while will show a long list of remote branches (origin/blah) the vast majority of which no longer exist.
The solution to the annoyance (which makes it difficult to locate an actually useful remote branch in the middle of the clutter) is to run:

git remote purge origin
[Note - this does NOT touch `origin`, but only acts locally on your .git/remotes directory]

For more context, see this thread.
On the same note, please remember that

git push origin :stale_branch

is a thoughtful command to run regularly on your no-longer-used tracking remote branches, as a courtesy to your OCD-suffering colleagues :)