Friday, August 29, 2008

App Engine conclusions

Having been eyeball deep in App Engine for a while, evaluating it for a project at work and putting together a presentation for the utah open source conference, I've reluctantly concluded that I don't like it.  I want to like it, since it's a great poster child for Python.  And there are some bright spots, like the dirt-simple integration with google accounts.  But it's so very very primitive in so many ways.  Not just the missing features, or the "you can use any web framework you like, as long as it's django" attitude, but primarily a lot of the existing API is just so very primitive.  

The DataStore in particular feels like a giant step backwards from using a traditional database with a sophisticated ORM.  Sure, it can scale if you use it right, but do you really know what that entails?

Take the example of simple counting of objects.  There's a count() method, but in practice, it's so slow you can't use it.  Denormalize with a .count property?  Yeah, that doesn't scale either: what you really need is a separate, sharded Counter class.  And yes, sharding is very, very manual.  (See slides 18-23 in the link there, and the associated video starting about 19:00.)  

You can't perform joins in GQL.  Or subselects.  Or call functions, aggregate or otherwise.  EVERYthing you are interested needs to be pre-computed.  (Or computed by hand client-side, which is so slow it's barely an option at all.)  I can extrapolate from this to my experience in production schemas and it's not pretty.

Of course, you also lose any ability to write declarative, set-based code, which is demonstrably less error-prone than the imperative alternative.  Take a simple example from my demo app.  Marking a group of todo items finished is four statements:

items = TodoItem.get_by_id(
  [int(id) for id in request.POST.getlist('item_id')])
for item in items:
  item.finished = datetime.now()
  item.put()

Compare this with SQL:

cursor.execute("update todo_items set finished = CURRENT_TIMESTAMP where id in %s",
             ([int(id) for id in request.POST.getlist('item_id')]))
Scalability is great but taking a big hit to back-end productivity is too high a price for all but a few applications.  GAE is still young, so maybe Google will improve things, but their attitude so far seems to be "we know how to scale so shut up and do it the hard way."  I hope I am wrong.

14 comments:

Anonymous said...

For your last example, I'd much rather do it the not-sql way. Less bug prone; using objects instead of strings; more maintainable.

Jonathan Ellis said...

How is it less bug prone to manually write a loop instead of letting the SQL compiler do it for you? To my mind, code you DON'T write is the best kind.

(The SQL example does not use string interpolation; most DB-API drivers use %s as the bind variable marker, too. Potentially confusing, but there's a couple syntactical clues to this even if you didn't know.)

Anonymous said...

I agree with (other) anonymous, the first version is far more readable.

It is not just about code you don't write, it is about code you need to later read and comprehend.

mike bayer said...

the SQL statement at the end is very easy to comprehend when issued by a reasonable ORM:

session.query(TodoItem).\
filter(TodoItem.id.in_(request.POST.getlist('item_id'))).\
update(finished=func.current_timestamp())

SQL still wins for me.

Noah Gift said...

I have to agree with your comment, "you can use any web framework you like, as long as it's django" for now. I haven't given up, as it does include WebOb for the most part and it is possible to get some third party components working like tempita.

Hopefully Google wakes up and puts less of an emphasis on Django and more on getting as many modules as possible to work in the standard library. There is still hope for appengine.

Shannon -jj Behrens said...

Useful. Thanks.

sevenseeker said...

Indeed Google would be wise to enable several more modules (like ast with full write) where there is not a legitimate security problem.
This would open up the Python community to some real possibilities.

But I do not want to be harsh. It would appear that app engine at this stage is a proof of concept if nothing else to see what the initial response is to it.

David said...

"Code you don't write is the best kind"

Amen!

Eran said...

What if you had a helper to do your batch work for you?

Something like this (I'm assuming we already have the items at hand):

items = TodoItem.get_by_id(
[int(id) for id in request.POST.getlist('item_id')])

b = batchUpdater(items)
b.finished = datetime.now()
b.put()

Would that be better?

After all the SQL statement you issue is the syntactic sugar of updating this data in the DB. The DB will (eventually and the lowest level) go through the ids one by one and update them. The syntax make it cleaner for you.

If you had a similar syntax here, it wouldn't be such a problem, plus, such a wrapper can better manipulate the back-end in a batch operation fashion.

Microsoft has LINQ in the latest version of .NET which makes the syntax (in most cases) easier when writing code but in compilation will generate the same ugly code one would do without LINQ. AFAIK, LINQ is used to query data, not update it, but you get the idea.

Anonymous said...

Scalability /is/ important. It is way more important than being able to do subselects and joins.

I've inherited the mess that is "let's just put everything in a database and worry about scaling later" from initial implementors, and it is /always/ a mistake to architect that way. Well, unless your plan is to fail. Because 100% guaranteed, if you create something that becomes popular, the single massive centralized database will cause you endless pain and torment until you partition it, at which point *YOU LOSE SUBSELECTS, JOINS, AGGREGATE FUNCTIONS et al ANYWAY*.

GQL is the right approach. Learning to architect systems that don't rely so much on those SQL-inspired crutches is the future. And the future is lovely.

Jonathan Ellis said...

It's important to know what you're designing for. It's true that for some applications, scalability is "way more important than being able to do subselects and joins." For most, though, ease of development is the right thing to optimize for. Resist the temptation to assume that whatever the bottleneck is on your most recent project, is _always_ the most important one.

(And it's worth pointing out that even if you're writing an app that will eventually need to scale to millions of users, if you can't get the first version out and useful and serving thousands of users out before your competition, you still lose.)

Mert Nuhoglu said...

Hi Jonathan, thanks for your critics on GAE. I am now evaluating GAE too. It is nice to see a critical perspective.
But regarding the SQL and Datastore example. Actually reading both of the snippets requires more or less same amount of effort. Datastore snippet consists of 25 words, whereas SQL consists of 23 words.

Justin Wilson said...

(real quick shout-out. Hey Jonathan how've you been? This is Justin from Neumont.)

I know I'm a little late to the game here.. but here goes..

Google has inspired me to pick up Python again and I've been browsing the web for info on some of the web frameworks available. I'm mainly interested in Django/Pylons/web2py. I was leaning more towards Pylons but I also like how easily web2py integrates with GAE, and how easily you can pull your app back out of GAE.

Have you used web2py at all? If so what are you thoughts about it and would you recommend it over Pylons/Django?

Jonathan Ellis said...

If I were going to write a GAE app I would go with django because nothing else is better enough (on GAE) to give up the relatively large django + GAE community.

Off of GAE I still prefer pylons because of the relatively good sqlalchemy integration (see my project formalchemy).

I think designing an app so you can pull it out of GAE is not worth what you give up. Choose your platform and then play to its strengths instead of limiting yourself to a lowest common denominator.

(I really like slicehost as an option for the more traditional route -- cheap starter plan, painless expansion of your vm when you need it, and good management tools. Disclaimer: I work for rackspace, which recently bought slicehost.)