Wednesday, April 09, 2008

Google App Engine: Return of the Unofficial Python Job Board Feed

Over three years ago (!), I wrote a screen scraper to turn the Python Job Board into an RSS feed. It didn't make it across one of several server moves since then, but now I've ported it to Google's App Engine: the new unofficial python job board feed.
I'll be making a separate post on the Google App Engine business model and when it makes sense to consider the App Engine for a product. Here I'm going to talk about my technical impressions.
First, here's the source. Nothing fancy. The only App Engine-specific API used is urlfetch.
Unfortunately, even something this simple bumps up against some pretty rough edges in App Engine. It's going to be a while before this is ready for production use.
The big one is scheduled background tasks. (If you think this is important, star the issue rather than posting a "me too" comment.) Related is a task queue that would allow those scheduled tasks to easily be split into bite-size pieces, which is important for Google to allow scheduled tasks (a) without worrying about runaway processes while (b) still accomplishing an arbitrary amount of work.
If there were a scheduled task api, my feed generator could poll the python jobs site hourly or so, and store the results in the Datastore, instead of having a 1:1 ratio of feed requests to remote fetches.
While you can certainly create a cron job to fetch a certain url of your app periodically, and have that url run your "scheduled task," things get tricky quickly if your task needs to perform more work than it can accomplish in the small per-page time allocation it gets. Fortunately, I expect a scheduled task api from App Engine sooner rather than later -- Google wants to be your one stop shop, and for a large set of applications (every web app I have ever seen has had some scheduled task component) to have to rely on an external server to ping the app with this sort of workaround defeats that purpose completely.
Another rough edge is in the fetch api. Backend fetches like mine need a callback api so that a slow remote server doesn't cause the fetch to fail forever from being auto-cancelled prematurely. Of course, this won't be useful until scheduled tasks are available. I'm thinking ahead. :)
Finally, be aware that fatal errors are not logged by default. If you want to log fatal errors, you need to do it yourself. the main() function is a good place for this if you are rolling your own simple script like I am here.







5 comments:

jek said...

This is great! Thank you!

Ksenia said...

If there were a scheduled task api, my feed generator could poll the python jobs site hourly or so, and store the results in the Datastore, instead of having a 1:1 ratio of feed requests to remote fetches.

You could probably store the results in the Datastore, along with the timestamp, and update it during request every hour or so ("if the data older than 1 hour, than fetch again"). That would make some requests taking more time, but at most once in an hour.

Jonathan Ellis said...

I considered doing that, but the complexity isn't worth it here.

The complexity is, what if two requests come in at the same time? You need some kind of locking to prevent duplicate entries; query-before-insert is still race-condition prone. Locking isn't provided by the GAE either, so you'd have to hack something using Datastore. (I'm assuming that the transactional semantics are strong enough that this is even possible; I haven't checked.)

So even for a simple case like this one, it's not as easy to work around the lack of scheduled tasks as you'd think. It really is a critical part of any non-toy application environment.

Kevin H said...

This is a really useful feed. Thanks for making it.

It would be helpful, though, if the location information of the positions wasn't lost in the translation. Any possibility of adding this?

Jonathan Ellis said...

Well, I posted the source. Patches accepted. :)