Cassandra is participating in the Google Summer of Code, which opened for proposal submission today. Cassandra is part of the Apache Software Foundation, which has its own page of guidelines up for students and mentors.
We have a good mix of project ideas involving both core and non-core areas, from straightforward code bashing to some pretty tricky stuff, depending on your appetite. Core tickets aren't necessarily harder than non-core, but they will require reading and understanding more existing code.
- Create a web ui for cassandra: we have a (fairly minimal) command line interface, but a web gui is more user-friendly. There is the beginnings of such a beast in the Cassandra source tree at contrib/cassandra_browser [pretty ugly Python code] and a gtk-based one at http://github.com/driftx/chiton [also Python, less ugly].
- First-class commandline interface: if you prefer to kick things old-school, improving the cli itself would also be welcome.
- Create a Cassandra demo application: we have Twissandra, but we can always use more examples to introduce people to "thinking in Casssandra," which is the hardest part of using it. This one seems to be the most popular with students so far. (So stand out from the crowd, and submit something else too. :)
- Performance regression tests: pretty self-explanatory?
- System tests against multiple nodes: If GSOC were a wish-granting fairy I would probably choose this with my first wish. There's a couple different ways you can approach this; scripting VMs is one, or you could explore the Cassandra simulator that was contributed a while ago (some TLC required).
- Hive support: Hive is a project that runs SQL queries against Hadoop map/reduce clusters. (For analytics; it is too high-latency to run applications against Hive directly).
HIVE-705added support for backends other than HDFS, with HBase as the first. Cassandra support should be doable too now. The Hive storage backends are described in http://wiki.apache.org/hadoop/Hive/StorageHandlers and the HBase backend specifically in http://wiki.apache.org/hadoop/Hive/HBaseIntegration.
- Avro RPC support: currently Cassandra's client layer is the Thrift RPC framework, which sucks for reasons outside our scope here. We're moving to Avro, the new hotness from Doug Cutting (creator of Lucene and Hadoop, you may have heard of those). Basically this means porting org.apache.cassandra.thrift.CassandraServer to org.apache.cassandra.avro.CassandraServer; some examples are already done by Eric Evans.
- Session-level consistency: In one and two Amazon discusses the concept of "eventual consistency." Cassandra uses eventual consistency in a design similar to Dynamo. Supporting session consistency would be useful and relatively easy to add: we already have the concept of a Memtable to "stage" updates in before flushing to disk; if we applied mutations to a session-level memtable on the coordinator machine (that is, the machine the client is connected to), and then did a final merge from that table against query results before handing them to the client, we'd get it almost for free.
- Optimize commitlog performance: this is about as low-level as you'll find in Cassandra's code base. fsync, CAS, it's all here. http://wiki.apache.org/cassandra/ArchitectureCommitLog describes the current CommitLog design.
You can comment directly on the JIRA tickets after creating an account (it's open to the public) if you're interested or have other questions. And of course feel free to propose other ideas!