Sunday, March 18, 2012

Speaking to a technical conference

I just got back from PyCon, and as with all conferences where the talks are delivered by engineers instead of professional speakers, we had a mixed bag. Some talks were great; others made me get my laptop out.

The most important important axiom is: a talk is not just an essay without random access. It's a different medium. Respect its strengths instead of wishing it were something it's not.

Here are some concrete principles that can help:

Don't read your slides

Advice often repeated, too-seldom followed. This is sometimes phrased as "make eye contact with your audience," but I've seen that second version interpreted to mean, "make eye contact while reading your slides, so your head pops up and down like a gopher poking out of its hole." So just don't read your slides, no matter what else you're doing.

Some good presenters go to extremes with this, with just one or two words per slide. This is fine as a stylistic embellishment, but not necessary for a good talk. You don't need to be that minimalistic. Just remember that with every transition, your audience will read the new slide before returning its attention to whatever you are saying. (Watch for this the next talk you attend; you will absolutely catch yourself doing it.)

Other presenters use "builds" to combat this. This can be useful in moderation, but it's more often used as a crutch, especially when presenting a list of related material. Personally, if I have an information-dense topic, like this one from my Strata talk, I'll put the whole list up at once but I'll leave the details off the slide and speak them instead.

I'm also not a fan of "presenter notes" displayed on a secondary monitor. Too often this leads to the gopher effect or to underpracticing, or both.

The one time you do want to explicitly direct attention to your slides is to explain part of it. For example, on this slide I explained that the upper right was an example of DataStax's Opscenter product interfacing with Cassandra over JMX; the upper left was jvisualvm, and so forth. Since it was a large room, I really did say things like, "in the upper right, ..." In a smaller room I like to stand close enough to the screen to just point.

Use visual aids

One of the best uses of builds is to explain a complicated diagram or sequence a piece at a time. This is difficult-to-impossible to do as effectively in prose alone. Sylvain's talk on the Cassandra storage engine at FOSDEM 2012 is a good example. Starting at about 22:00, he explains how Cassandra uses log-structured merge trees to turn random writes into sequential i/o. Compare that with the treatment in the Bigtable paper, or the original LSTM paper. Sylvain's explanation is much more clear by virtue of how it's presented.

I avoid audio or video during my presentations since using it effectively is a skill I don't yet have, but I've seen it done well by others. I can't imagine my favorite PyCon talk being as effective without the recorded demonstration at the end.

Finally, pictures can also be more effective than the spoken word at communicating humor. I'm not sure who came up with this first, but the juxtaposition here is worth well over 1000 words.

Leave them wanting more

Your goal in most public speaking is to get people interested enough to learn more on their own, not to make them experts.

One thing I struggled with early on was, how do you explain code without reading your slides? I realized that the answer was, if you're trying to explain code, you're getting too deep into the weeds. Sometimes I'll use a snippet of code to give the "flavor" of an API, but wall-of-text slides mean you're Doing It Wrong.

Another common mistake is to start your talk with an outline. (Worse: outline "progress reports" during the talk that tell the audience how far along you are.)

A much better way to get the audience engaged is to tell a story: How did you come across the problem you are solving? What makes it challenging? What promising approaches didn't actually work out, and why? This is a classic story arc that will get people interested much more than if you dive into the nuts and bolts of your solution.


Paul Graham gets this one wrong: while ad-libbing is indeed the polar opposite of reading your slides, it's also sub-optimal. You need practice to get timings right, to try out different phrasings of your thoughts, and to make transitions smooth. Don't fall for the false dichotomy that either you ad-lib or you practice all the spontaneity out; there's a happy medium in between.


Finally (last and least?), a brief note on mechanics. Stand where you can gesture freely and naturally; ideally (in a small room) next to the screen. Don't stand behind a podium. Don't speak sitting down. Pacing a little bit is good.

All these things mean: you need a slide remote. Even if you are right next to your laptop, reaching down to hit the spacebar or arrow key is distracting. But if you are doing it right you are probably not right next to your laptop. The remote included with Macs is unfortunately not enough, since it relies on infrared line-of-sight. If the conference doesn't provide one, borrow one from another presenter. If you speak frequently, it's worth the approximately $40 cost to get your own so you don't have to wrestle with unfamiliar hardware when you go live.

Good luck!

Thursday, February 02, 2012

Thinkpad 420s review

In the last three years my primary machines have run OS X, Linux, Windows, OS X, and now Windows again, in that order. The observant reader may note, "That's a lot of machines in three years." It is, but I also changed jobs twice in that time frame, so that's part of it. Another part is that I'm a bit rough with laptops; the two mac machines broke badly enough that AppleCare told me they weren't going to help. The Dell and Lenovo machines, though, outlasted my use of them. For this most recent machine, I had several requirements and several nice-to-haves, some of which were in tension. Requirements:
  • Able to drive a 30" external monitor
  • At least 8GB of RAM
  • At least 1440x900 native resolution
Nice to have:
  • Smaller than my 15" macbook pro, which is too large to use comfortably in coach on an airplane
  • Larger screen than my wife's 13.3" mbp
  • A "real" cpu, not the underclocked ones in the Macbook Airs
  • A graphics card that can do justice to Starcraft II
I wasn't picky about my operating system. Linux is by far the best experience for software development, but support for multiple monitors is still dicey, which is bad when you're relying on it to give presentations on unfamiliar projectors. OS X is superficially unix-y but lack of package management means in practice it's not really any better than Windows. Windows is ... Windows, although I'm pretty fond of the new-in-Windows-7 window management keyboard shortcuts. But fundamentally I spend 99% of my time in an IDE, a web browser, hipchat, and IRC, all of which are cross-platform. MSYS gives me about as much of the unix experience on Windows, as I got on OS X. (And ninite gives me more of a package manager than I had on OS X--granted, that isn't saying much.)
I ended up buying a 14" Thinkpad 420s. I think the S stands for "slim," and it is. My 15" mbp looks and weighs like a ton of bricks next to it, even after I swapped out the Thinkpad's dvd drive for the supplementary battery module, which weighs a little more. The Thinkpad's legendary keyboard lives up to its reputation, and I'm a huge fan of the trackpoint living right there on the home row of the keyboard, for when keyboard shortcuts aren't easily available. The cooling is excellent without the fans ever getting loud.
For the most part, I'm extremely happy with the hardware. There are two exceptions:
  • The built-in microphone is terrible. Almost without exception, people have trouble hearing me over Skype. Adding insult to injury, there is no mic input. I thought at first the headphone jack was a phone-style out-plus-in jack, but no. I'll have to get a USB mic.
  • Optimus doesn't work in one important respect: in optimus mode it won't drive my 30" monitor at full resolution; it picks something weird like 2048x1560 instead. Lenovo said they were going to fix this but hasn't, yet. To drive this monitor correctly I have to lock it to discrete graphics in the bios. In discrete mode it gets about 2h 45m battery life even with the CPU downclocked and the display dim. So when I travel, I reboot to integrated graphics.
The main alternative I considered was the Sony Vaio Z. Ironically, I ended up going with the Thinkpad mostly because Vaio reviewers consistently called out how terrible its built-in speakers were... so I ended up with a system with a terrible mic instead.
On the software side, I'm more than happy with Windows, especially after the Steam holiday sale. I hadn't realized how many fantastic indie games are available these days. (Most recently, I highly recommend Bastion.)
The one fly in my soup is that I'd anticipated being able to run OS X in a VM for the sake of Keynote. Neither Google Docs presentations, Open/Libre Office Impress, or Powerpoint are adequate replacements. Unfortunately, the Core Image (?) APIs used by Keynote don't work under virtualization, so for now I'm still using my old mbp to create presentations, and taking them on the road with me as pdf.

Saturday, November 19, 2011

On applying for jobs

A friend asks,
If [I see a] job I could do, even though I don't meet the stated requirements, should I apply anyway?
Short answer: yes.

Longer answer: companies are all over the map here, although in general the less layers of bureaucracy there are between the team that the candidate will work with and the hiring process, the more likely the list of requirements is to be actual requirements.

How can you tell?

HR paper pushers like to think in terms of checklists because that lets them go through hundreds of resumes without any real understanding of the position, so they write ads like this one -- lots of really specific "5+ years of X," not much about what the position actually involves.

But if it's the team lead himself writing the description, which you will see at smaller companies, then you get much more about what the position involves and less checklist items, because the lead is comfortable determining competence based on skill instead of pattern matching. For a software development position, I don't care if you have a degree in CS if you can code. (Open-source contributions are a better signal for ability and passion than a degree, anyway.) My team has people with no degree, to people with PhDs.

Even when dealing with large companies, you have to factor in that people are terrible at distinguishing "want" from "need." A lot of "requirements" are really "nice-to-haves." It can be tough to tell the difference, but the better idea you have of what the job actually involves, the better you can tell which are hard requirements.

For instance: without knowing anything else about a position, my guess is that "native French speaker" really would be a hard requirement. That's not the sort of thing people tend to put down on a whim. But even then, there are shades of grey. For instance, if I were looking for a job and found a "distributed databases developer position, must know Java, be familiar with open source and be a native French speaker" then I might see if they'd give me a pass on the last part because I'm a really good fit for the rest -- and I know they're unlikely to find a lot of candidates with an exact match.

In short, you have little to lose by trying, but don't just shotgun out resumes; include a cover letter that highlights the best matches from your experience to what they are looking for. Follow up with the hiring manager if possible to ask (a) "I sent in my resume a few days ago, and I wanted to see where you are in the hiring process for this position," and if they reply that they got it but you're not a good fit, ask (b) what specifically they were looking for, so you can flesh out your intuition that much more for next time.

Good luck!

Tuesday, January 04, 2011

Apache Cassandra: 2010 in review

In 2010, Apache Cassandra increased its momentum as the leading scalable database. Here is a summary of the notable activity in three areas: code, community and controversy. As always, comments are welcome.


2010 started with the release of Cassandra 0.5, followed by 0.6 and graduation from the ASF incubator a few months later. Seven more stable releases of 0.6 proceeded, adding many features to improve operations in response to feedback from production users.

0.7 adds highly anticipated features like column value indexes, live schema updates, more efficient cluster expansion, and more control over replication, but didn't quite make it into 2010, with rc4 released on new year's 2011.

We also committed the distributed counters patchset, begun at Digg and enhanced by Twitter for their real-time analytics product. Notable as the most-involved feature discussion to date, distributed counters started with a vector clock approach, but switched to a new design by Kelvin Kakugawa after we realized vector clocks were a dead end for anything but the trivial case of monotonic-increments-by-one.

One of the biggest trends was increasing activity around Cassandra as well as in the core database itself. 2010 saw Hadoop map/reduce integration, as well as Pig support and a patch for Hive.

We also saw Lucandra, which implements a Cassandra back end for Lucene and is used in several high volume production sites, grow up into Solandra, embedding Solr and Cassandra in the same JVM for even more performance.


Cassandra hit its stride in 2010, starting with graduation from the ASF incubator in April. 2010 saw 1025 tickets resolved, nearly twice as many compared to 2009 (565).

Like many Apache projects, Cassandra has a relatively small set of committers, but a much larger group of contributors. In 2010 Cassandra passed over 100 people who have contributed at least one patch. Release manager Eric Evans put together a great way to visual this with a Code Swarm video of Cassandra development.

I started Riptano with Matt Pfeil in April to provide professional products and services around Cassandra. In October, we announced funding from Lightspeed and Sequoia. From May to December, we conducted eleven Cassandra training events in eight months, and twice that many private classes on-site with customers.

Riptano is now up to 25 employees, with offices in the San Francisco bay area, Austin, and New York, and engineers working remotely in San Antonio, France, and Belarus.

In August, Riptano and Rackspace organized a very successful inaugural Cassandra Summit, with about 200 attendees (videos available), followed by almost a full track at ApacheCon in November. Cassandra was also represented at many other conferences on multiple subjects, for several languages, and continents.


Cassandra got a lot of negative publicity when Kevin Rose blamed Cassandra for Digg v4's teething problems. However, there was no deluge of bug reports coming out of Digg's Cassandra team, and Digg engineers Arin Sarkissian and Chris Goffinet (now working on Cassandra for Twitter) got on Quora to refute the idea that Cassandra was at fault:

The whole "Cassandra to blame" thing is 100% a result of folks clinging on to the NoSQL vs SQL thing. It's a red herring.

The new version of Digg has a whole new architecture with a bunch of technologies involved. Problem is, over the last few months or so the only technological change we mentioned (blogged about etc) was Cassandra. That made it pretty easy for folks to cling on to it as the "problem".

Meanwhile, Digg competitor Reddit has continued migrating to Cassandra, crediting it with enabling their 3x traffic growth in 2010.

More importantly, 2010 saw dozens of new Cassandra deployments, including a new contender for the largest-cluster crown when Digital Reasoning announced a 400-node cluster for the US government.

We look forward to another great year in 2011!

Monday, April 26, 2010

And now for something completely different

A month ago I left Rackspace to start Riptano, a Cassandra support and services company.

I was in the unusal position of being a technical person looking for a business-savvy co-founder. For whatever reason, the converse seems a lot more common. Maybe technical people tend to sterotype softer skills as being easy.

But despite some examples to the contrary (notably for me, Josh Coates at Mozy), I found that starting a company is too hard for just one person. Unfortunately, all of my fairly slim portfolio of business guys I'd like to co-found with were unavailable. So progress was slow, until Matt Pfeil heard that I was leaving Rackspace and drove to San Antonio from Austin to talk me out of it. Not only was he not successful in talking me out of leaving, but he ended up co-founding Riptano. And here we are, with a Riptano mini-faq.

Isn't Cassandra mostly just a web 2.0 thing for ex-mysql shops?

Although most of the early adopters fit this stereotype, we're seeing interest from a lot of Oracle users and a lot of industries. Unlike many "NoSQL" databases, Cassandra doesn't drop durability (the D in ACID), and besides scalability, enterprises are very interested in our support for multiple data centers and Hadoop analytics.

Are you going to fork Cassandra?

No. Although the ASF license allows doing basically anything with the code, including creating proprietary forks, we think the track record of this strategy in the open source database world is mixed at best.

We might create a (still open-source) Cassandra distribution similar to Cloudera's Distribution for Hadoop, but the mainline Cassandra development is responsive enough that there isn't as much need for a third party to do this as there is with Hadoop.

What does Rackspace think?

Rackspace has been the primary driver of Cassandra development recently, employing (until I left) the three most active committers on the project. For the same reasons Rackspace supported Cassandra to begin with, Rackspace is excited to see Riptano help take the Cassandra ecosystem to the next level. Rackspace has invested in Riptano and has been completely supportive in every way.

Where did you get the name "Riptano?" Does it mean anything?

We took a sophisticated, augmented AI approach. By which I mean, we took a program that generated random, pronouceable strings, and put together a couple fragments that sounded good together. (This is basically the same approach we took at Mozy, only there Josh insisted on a four letter domain name which narrowed it down a lot.)

I hope it doesn't mean "your dog has bad breath" somewhere.

And yes, Riptano is on twitter.

Are you hiring?

Yes. We'll have a jobs page on the site soon. In the meantime you can email me a resume if you can't wait. Prior participation in the Apache Cassandra project is of course a huge plus.

Wednesday, April 07, 2010

Cassandra: Fact vs fiction

Cassandra has seen some impressive adoption success over the past months, leading some to conclude that Cassandra is the frontrunner in the highly scalable databases space (a subset of the hot NoSQL category). Among all the attention, some misunderstandings have been propagated, which I'd like to clear up.

Fiction: "Cassandra relies on high-speed fiber between datacenters" and can't reliably replicate between datacenters with more than a few ms of latency between them.

Fact: Cassandra's multi-datacenter replication is one of its earliest features and is by far the most battle-tested in the NoSQL space. Facebook had Cassandra deployed on east and west coast datacenters since before open sourcing it. SimpleGeo's Cassandra cluster spans 3 EC2 availability zones, and Digg is also deployed on both coasts. Claims that this can't possibly work are an excellent sign that you're reading an article by someone who doesn't know what he's talking about.

Fiction: "It’s impossible to tell when [Cassandra] replicas will be up-to-date."

Fact: Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor), to use the Dynamo vocabulary. If you do writes and reads both with QUORUM, for one example, you can expect data consistency as soon as there are enough reachable nodes for a quorum. Cassandra also provides read repair and anti-entropy, so that even reads at ConsistencyLevel.ONE will be consistent after either of these events.

Fiction: Cassandra has a small community

Fact: Although popularity has never been a good metric for determining correctness, it's true that when using bleeding edge technology, it's good to have company. As I write this late at night (in the USA), there are 175 people in the Cassandra irc channel, 60 in the HBase one, 32 in Riak's, and 15 in Voldemort's. (Six months ago, the numbers were 90, 45, and 12 for Cassandra, HBase, and Voldemort. I did not hang out in #riak yet then.) Mailing list participation tells a similar story.

It's also interesting that the creators of Thrudb and dynomite are both using Cassandra now, indicating that the predicted NoSQL consolidation is beginning.

Fiction: "Cassandra only supports one [keyspace] per install."

Fact: This has not been true for almost a year (June of 2009).

Fiction: Cassandra cannot support Hadoop, or supporting tools such as Pig.

Fact: It has always been straightforward to send the output of Hadoop jobs to Cassandra, and Facebook, Digg, and others have been using Hadoop like this as a Cassandra bulk-loader for over a year. For 0.6, I contributed a Hadoop InputFormat and related code to let Hadoop jobs process data from Cassandra as well, while cooperating with Hadoop to keep processing on the nodes that actually hold the data. Stu Hood then contributed a Pig LoadFunc, also in 0.6.

Fiction: Cassandra achieves its high performance by sacrificing reliability (alternately phrased: Cassandra is only good for data you can afford to lose)

Fact: unlike some NoSQL databases (notably MongoDB and HBase), Cassandra offers full single-server durability. Relying on replication is not sufficient for can't-afford-to-lose-data scenarios; if your data center loses power, you are highly likely to lose data if you are not syncing to disk no matter how many replicas you have, and if you run large systems in production long enough, you will realize that power outages through some combination of equipment failure and human error are not occurrences you can ignore. But with its fsync'd commitlog design, Cassandra can protect you against that scenario too.

What to do after your data is saved, e.g. backups and snapshots, is outside of my scope here but covered in the operations wiki page.

Tuesday, March 30, 2010

Cassandra in Google Summer of Code 2010

Cassandra is participating in the Google Summer of Code, which opened for proposal submission today. Cassandra is part of the Apache Software Foundation, which has its own page of guidelines up for students and mentors.

We have a good mix of project ideas involving both core and non-core areas, from straightforward code bashing to some pretty tricky stuff, depending on your appetite. Core tickets aren't necessarily harder than non-core, but they will require reading and understanding more existing code.


  • Create a web ui for cassandra: we have a (fairly minimal) command line interface, but a web gui is more user-friendly. There is the beginnings of such a beast in the Cassandra source tree at contrib/cassandra_browser [pretty ugly Python code] and a gtk-based one at [also Python, less ugly].
  • First-class commandline interface: if you prefer to kick things old-school, improving the cli itself would also be welcome.
  • Create a Cassandra demo application: we have Twissandra, but we can always use more examples to introduce people to "thinking in Casssandra," which is the hardest part of using it. This one seems to be the most popular with students so far. (So stand out from the crowd, and submit something else too. :)



  • Avro RPC support: currently Cassandra's client layer is the Thrift RPC framework, which sucks for reasons outside our scope here. We're moving to Avro, the new hotness from Doug Cutting (creator of Lucene and Hadoop, you may have heard of those). Basically this means porting org.apache.cassandra.thrift.CassandraServer to org.apache.cassandra.avro.CassandraServer; some examples are already done by Eric Evans.
  • Session-level consistency: In one and two Amazon discusses the concept of "eventual consistency." Cassandra uses eventual consistency in a design similar to Dynamo. Supporting session consistency would be useful and relatively easy to add: we already have the concept of a Memtable to "stage" updates in before flushing to disk; if we applied mutations to a session-level memtable on the coordinator machine (that is, the machine the client is connected to), and then did a final merge from that table against query results before handing them to the client, we'd get it almost for free.
  • Optimize commitlog performance: this is about as low-level as you'll find in Cassandra's code base. fsync, CAS, it's all here. describes the current CommitLog design.

You can comment directly on the JIRA tickets after creating an account (it's open to the public) if you're interested or have other questions. And of course feel free to propose other ideas!