Saturday, November 19, 2011

On applying for jobs

A friend asks,
If [I see a] job I could do, even though I don't meet the stated requirements, should I apply anyway?
Short answer: yes.

Longer answer: companies are all over the map here, although in general the less layers of bureaucracy there are between the team that the candidate will work with and the hiring process, the more likely the list of requirements is to be actual requirements.

How can you tell?

HR paper pushers like to think in terms of checklists because that lets them go through hundreds of resumes without any real understanding of the position, so they write ads like this one -- lots of really specific "5+ years of X," not much about what the position actually involves.

But if it's the team lead himself writing the description, which you will see at smaller companies, then you get much more about what the position involves and less checklist items, because the lead is comfortable determining competence based on skill instead of pattern matching. For a software development position, I don't care if you have a degree in CS if you can code. (Open-source contributions are a better signal for ability and passion than a degree, anyway.) My team has people with no degree, to people with PhDs.

Even when dealing with large companies, you have to factor in that people are terrible at distinguishing "want" from "need." A lot of "requirements" are really "nice-to-haves." It can be tough to tell the difference, but the better idea you have of what the job actually involves, the better you can tell which are hard requirements.

For instance: without knowing anything else about a position, my guess is that "native French speaker" really would be a hard requirement. That's not the sort of thing people tend to put down on a whim. But even then, there are shades of grey. For instance, if I were looking for a job and found a "distributed databases developer position, must know Java, be familiar with open source and be a native French speaker" then I might see if they'd give me a pass on the last part because I'm a really good fit for the rest -- and I know they're unlikely to find a lot of candidates with an exact match.

In short, you have little to lose by trying, but don't just shotgun out resumes; include a cover letter that highlights the best matches from your experience to what they are looking for. Follow up with the hiring manager if possible to ask (a) "I sent in my resume a few days ago, and I wanted to see where you are in the hiring process for this position," and if they reply that they got it but you're not a good fit, ask (b) what specifically they were looking for, so you can flesh out your intuition that much more for next time.

Good luck!

Tuesday, January 04, 2011

Apache Cassandra: 2010 in review

In 2010, Apache Cassandra increased its momentum as the leading scalable database. Here is a summary of the notable activity in three areas: code, community and controversy. As always, comments are welcome.

Code

2010 started with the release of Cassandra 0.5, followed by 0.6 and graduation from the ASF incubator a few months later. Seven more stable releases of 0.6 proceeded, adding many features to improve operations in response to feedback from production users.

0.7 adds highly anticipated features like column value indexes, live schema updates, more efficient cluster expansion, and more control over replication, but didn't quite make it into 2010, with rc4 released on new year's 2011.

We also committed the distributed counters patchset, begun at Digg and enhanced by Twitter for their real-time analytics product. Notable as the most-involved feature discussion to date, distributed counters started with a vector clock approach, but switched to a new design by Kelvin Kakugawa after we realized vector clocks were a dead end for anything but the trivial case of monotonic-increments-by-one.

One of the biggest trends was increasing activity around Cassandra as well as in the core database itself. 2010 saw Hadoop map/reduce integration, as well as Pig support and a patch for Hive.

We also saw Lucandra, which implements a Cassandra back end for Lucene and is used in several high volume production sites, grow up into Solandra, embedding Solr and Cassandra in the same JVM for even more performance.

Community

Cassandra hit its stride in 2010, starting with graduation from the ASF incubator in April. 2010 saw 1025 tickets resolved, nearly twice as many compared to 2009 (565).

Like many Apache projects, Cassandra has a relatively small set of committers, but a much larger group of contributors. In 2010 Cassandra passed over 100 people who have contributed at least one patch. Release manager Eric Evans put together a great way to visual this with a Code Swarm video of Cassandra development.

I started Riptano with Matt Pfeil in April to provide professional products and services around Cassandra. In October, we announced funding from Lightspeed and Sequoia. From May to December, we conducted eleven Cassandra training events in eight months, and twice that many private classes on-site with customers.

Riptano is now up to 25 employees, with offices in the San Francisco bay area, Austin, and New York, and engineers working remotely in San Antonio, France, and Belarus.

In August, Riptano and Rackspace organized a very successful inaugural Cassandra Summit, with about 200 attendees (videos available), followed by almost a full track at ApacheCon in November. Cassandra was also represented at many other conferences on multiple subjects, for several languages, and continents.

Controversy

Cassandra got a lot of negative publicity when Kevin Rose blamed Cassandra for Digg v4's teething problems. However, there was no deluge of bug reports coming out of Digg's Cassandra team, and Digg engineers Arin Sarkissian and Chris Goffinet (now working on Cassandra for Twitter) got on Quora to refute the idea that Cassandra was at fault:

The whole "Cassandra to blame" thing is 100% a result of folks clinging on to the NoSQL vs SQL thing. It's a red herring.

The new version of Digg has a whole new architecture with a bunch of technologies involved. Problem is, over the last few months or so the only technological change we mentioned (blogged about etc) was Cassandra. That made it pretty easy for folks to cling on to it as the "problem".

Meanwhile, Digg competitor Reddit has continued migrating to Cassandra, crediting it with enabling their 3x traffic growth in 2010.

More importantly, 2010 saw dozens of new Cassandra deployments, including a new contender for the largest-cluster crown when Digital Reasoning announced a 400-node cluster for the US government.

We look forward to another great year in 2011!

Monday, April 26, 2010

And now for something completely different

A month ago I left Rackspace to start Riptano, a Cassandra support and services company.

I was in the unusal position of being a technical person looking for a business-savvy co-founder. For whatever reason, the converse seems a lot more common. Maybe technical people tend to sterotype softer skills as being easy.

But despite some examples to the contrary (notably for me, Josh Coates at Mozy), I found that starting a company is too hard for just one person. Unfortunately, all of my fairly slim portfolio of business guys I'd like to co-found with were unavailable. So progress was slow, until Matt Pfeil heard that I was leaving Rackspace and drove to San Antonio from Austin to talk me out of it. Not only was he not successful in talking me out of leaving, but he ended up co-founding Riptano. And here we are, with a Riptano mini-faq.

Isn't Cassandra mostly just a web 2.0 thing for ex-mysql shops?

Although most of the early adopters fit this stereotype, we're seeing interest from a lot of Oracle users and a lot of industries. Unlike many "NoSQL" databases, Cassandra doesn't drop durability (the D in ACID), and besides scalability, enterprises are very interested in our support for multiple data centers and Hadoop analytics.

Are you going to fork Cassandra?

No. Although the ASF license allows doing basically anything with the code, including creating proprietary forks, we think the track record of this strategy in the open source database world is mixed at best.

We might create a (still open-source) Cassandra distribution similar to Cloudera's Distribution for Hadoop, but the mainline Cassandra development is responsive enough that there isn't as much need for a third party to do this as there is with Hadoop.

What does Rackspace think?

Rackspace has been the primary driver of Cassandra development recently, employing (until I left) the three most active committers on the project. For the same reasons Rackspace supported Cassandra to begin with, Rackspace is excited to see Riptano help take the Cassandra ecosystem to the next level. Rackspace has invested in Riptano and has been completely supportive in every way.

Where did you get the name "Riptano?" Does it mean anything?

We took a sophisticated, augmented AI approach. By which I mean, we took a program that generated random, pronouceable strings, and put together a couple fragments that sounded good together. (This is basically the same approach we took at Mozy, only there Josh insisted on a four letter domain name which narrowed it down a lot.)

I hope it doesn't mean "your dog has bad breath" somewhere.

And yes, Riptano is on twitter.

Are you hiring?

Yes. We'll have a jobs page on the site soon. In the meantime you can email me a resume if you can't wait. Prior participation in the Apache Cassandra project is of course a huge plus.

Wednesday, April 07, 2010

Cassandra: Fact vs fiction

Cassandra has seen some impressive adoption success over the past months, leading some to conclude that Cassandra is the frontrunner in the highly scalable databases space (a subset of the hot NoSQL category). Among all the attention, some misunderstandings have been propagated, which I'd like to clear up.

Fiction: "Cassandra relies on high-speed fiber between datacenters" and can't reliably replicate between datacenters with more than a few ms of latency between them.

Fact: Cassandra's multi-datacenter replication is one of its earliest features and is by far the most battle-tested in the NoSQL space. Facebook had Cassandra deployed on east and west coast datacenters since before open sourcing it. SimpleGeo's Cassandra cluster spans 3 EC2 availability zones, and Digg is also deployed on both coasts. Claims that this can't possibly work are an excellent sign that you're reading an article by someone who doesn't know what he's talking about.

Fiction: "It’s impossible to tell when [Cassandra] replicas will be up-to-date."

Fact: Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor), to use the Dynamo vocabulary. If you do writes and reads both with QUORUM, for one example, you can expect data consistency as soon as there are enough reachable nodes for a quorum. Cassandra also provides read repair and anti-entropy, so that even reads at ConsistencyLevel.ONE will be consistent after either of these events.

Fiction: Cassandra has a small community

Fact: Although popularity has never been a good metric for determining correctness, it's true that when using bleeding edge technology, it's good to have company. As I write this late at night (in the USA), there are 175 people in the Cassandra irc channel, 60 in the HBase one, 32 in Riak's, and 15 in Voldemort's. (Six months ago, the numbers were 90, 45, and 12 for Cassandra, HBase, and Voldemort. I did not hang out in #riak yet then.) Mailing list participation tells a similar story.

It's also interesting that the creators of Thrudb and dynomite are both using Cassandra now, indicating that the predicted NoSQL consolidation is beginning.

Fiction: "Cassandra only supports one [keyspace] per install."

Fact: This has not been true for almost a year (June of 2009).

Fiction: Cassandra cannot support Hadoop, or supporting tools such as Pig.

Fact: It has always been straightforward to send the output of Hadoop jobs to Cassandra, and Facebook, Digg, and others have been using Hadoop like this as a Cassandra bulk-loader for over a year. For 0.6, I contributed a Hadoop InputFormat and related code to let Hadoop jobs process data from Cassandra as well, while cooperating with Hadoop to keep processing on the nodes that actually hold the data. Stu Hood then contributed a Pig LoadFunc, also in 0.6.

Fiction: Cassandra achieves its high performance by sacrificing reliability (alternately phrased: Cassandra is only good for data you can afford to lose)

Fact: unlike some NoSQL databases (notably MongoDB and HBase), Cassandra offers full single-server durability. Relying on replication is not sufficient for can't-afford-to-lose-data scenarios; if your data center loses power, you are highly likely to lose data if you are not syncing to disk no matter how many replicas you have, and if you run large systems in production long enough, you will realize that power outages through some combination of equipment failure and human error are not occurrences you can ignore. But with its fsync'd commitlog design, Cassandra can protect you against that scenario too.

What to do after your data is saved, e.g. backups and snapshots, is outside of my scope here but covered in the operations wiki page.

Tuesday, March 30, 2010

Cassandra in Google Summer of Code 2010

Cassandra is participating in the Google Summer of Code, which opened for proposal submission today. Cassandra is part of the Apache Software Foundation, which has its own page of guidelines up for students and mentors.

We have a good mix of project ideas involving both core and non-core areas, from straightforward code bashing to some pretty tricky stuff, depending on your appetite. Core tickets aren't necessarily harder than non-core, but they will require reading and understanding more existing code.

Non-core

  • Create a web ui for cassandra: we have a (fairly minimal) command line interface, but a web gui is more user-friendly. There is the beginnings of such a beast in the Cassandra source tree at contrib/cassandra_browser [pretty ugly Python code] and a gtk-based one at http://github.com/driftx/chiton [also Python, less ugly].
  • First-class commandline interface: if you prefer to kick things old-school, improving the cli itself would also be welcome.
  • Create a Cassandra demo application: we have Twissandra, but we can always use more examples to introduce people to "thinking in Casssandra," which is the hardest part of using it. This one seems to be the most popular with students so far. (So stand out from the crowd, and submit something else too. :)

Almost-core

Core

  • Avro RPC support: currently Cassandra's client layer is the Thrift RPC framework, which sucks for reasons outside our scope here. We're moving to Avro, the new hotness from Doug Cutting (creator of Lucene and Hadoop, you may have heard of those). Basically this means porting org.apache.cassandra.thrift.CassandraServer to org.apache.cassandra.avro.CassandraServer; some examples are already done by Eric Evans.
  • Session-level consistency: In one and two Amazon discusses the concept of "eventual consistency." Cassandra uses eventual consistency in a design similar to Dynamo. Supporting session consistency would be useful and relatively easy to add: we already have the concept of a Memtable to "stage" updates in before flushing to disk; if we applied mutations to a session-level memtable on the coordinator machine (that is, the machine the client is connected to), and then did a final merge from that table against query results before handing them to the client, we'd get it almost for free.
  • Optimize commitlog performance: this is about as low-level as you'll find in Cassandra's code base. fsync, CAS, it's all here. http://wiki.apache.org/cassandra/ArchitectureCommitLog describes the current CommitLog design.

You can comment directly on the JIRA tickets after creating an account (it's open to the public) if you're interested or have other questions. And of course feel free to propose other ideas!

Wednesday, March 24, 2010

Cassandra in action

There's been a lot of new articles about Cassandra deployments in the past month, enough that I thought it would be useful to summarize in a post.

Ryan King explained in an interview with Alex Popescu why Twitter is moving to Cassandra for tweet storage, and why they selected Cassandra over the alternatives. My experience is that the more someone understands large systems and the problems you can run into with them from an operational standpoint, the more likely they are to choose Cassandra when doing this kind of evaluation. Ryan's list of criteria is worth checking out.

Digg followed up their earlier announcement that they had taken part of their site live on Cassandra with another saying that they've now "reimplemented most of Digg's functionality using Cassandra as our primary datastore." Digg engineer Ian Eure also gave some more details on Digg's cassandra data model in a Hacker News thread.

Om Malik quoted extensively from the Digg announcement and from Rackspace engineer Stu Hood, who explained Cassandra's appeal: "Over the Bigtable clones, Cassandra has huge high-availability advantages, and no single point of failure. When compared to the Dynamo adherents, Cassandra has the advantage of a more advanced datamodel, allowing for a single row to contain billions of column/value pairs: enough to fill a machine. You also get efficient range queries for the top level key, and even within your values."

The Twitter and Digg news kicked off a lot of publicity, including a lot of "me too" articles but some interesting ones, including a highscalability post wondering if this was the end of the mysql + memcached era. If not quite yet the end, then the beginning of it. As Ian Eure from Digg said, "If you're deploying memcache on top of your database, you're inventing your own ad-hoc, difficult to maintain NoSQL system." Possibly the best commentary on this idea is Dare Obasanjo's, who explained "Digg's usage of Cassandra actually serves as a rebuttal to [an article claiming SQL scales just fine] since they couldn't feasibly get what they want with either horizontal or vertical scaling of their relational database-based solution."

Reddit also migrated to Cassandra from memcachedb, in only 10 days, the fastest migration to Cassandra I've seen. More comments from the engineer doing the migration, ketralnis, in the reddit discussion thread.

CloudKick blogged about how they use Cassandra for time series data, including a sketch of their data model. CloudKick migrated from PostgreSQL, skewering the theory you will sometimes see proffered that "only MySQL users are migrating to NoSQL, not people who use [my favorite vendor's relational database]."

Jake Luciani wrote about how Lucandra, the Cassandra Lucene back-end works, and how he's using it to power the Twitter search app sparse.ly. IMO, Lucandra is one of Cassandra's killer apps.

The FightMyMonster team switched from HBase to Cassandra after concluding that "HBase is more suitable for data warehousing, and large scale data processing and analysis... and Cassandra is more suitable for real time transaction processing and the serving of interactive data." Dominic covers CAP, architecture considerations, benchmarks, map/reduce, and durability in explaining his conclusion.

Eric Peters gave a talk on Cassandra use at his company, Frugal Mechanic, at the Seattle Tech Startups Meetup. This was interesting not because Frugal Mechanic is a big name but because it's not. I haven't seen Eric's name on the Cassandra mailing lists at all, but there he was deploying it and giving a talk on it, showing that Cassandra is starting to move beyond early adopters. (And, just maybe, that our documentation is improving. :)

Finally, Eric Florenzano has a live demo up now of Cassandra running a Twitter clone at twissandra.com, with source at github, as an example of how to use Cassandra's data model. If you're interested in the nuts and bolts of how to build an app on Cassandra, you should check it out.

Monday, March 15, 2010

Why your data may not belong in the cloud

Several of the reports of the recently-concluded NoSQL Live event mentioned that I took a contrarian position on the "NoSQL in the Cloud" panel, arguing that traditional, bare metal servers usually make more sense. Here's why.

There are two reasons to use cloud infrastructure (and by cloud I mean here "commodity VMs such as those provided by Rackspace Cloud Servers or Amazon EC2):

  1. You only need a fraction of the capacity of a single machine
  2. Your demand is highly elastic; you want to be able to quickly spin up many new instances, then drop them when you are done
Most people looking at NoSQL solutions are doing it because their data is larger than a traditional solution can handle, or will be, so (1) is not a very strong motivation. But what about (2)? At first glance, cloud is a great fit for adding capacity to a database cluster painlessly. But there's an important difference between load like web traffic that bounces up and down frequently, and databases: with few exceptions, databases only get larger with time. You won't have 20 TB of data this week, and 2 next.

When capacity only grows in one direction it makes less sense to pay a premium for the flexibility of being able to reduce your capacity nearly instantly, especially when you also get reduced I/O performance (the most common bottleneck for databases) in the bargain because of the virtualization layer. That's why, despite working for a cloud provider, I don't think it's always a good fit for databases. (It doesn't hurt that Rackspace also offers classic bare metal hosting in the same data centers, so you can have the best of both worlds.)