Skip to main content

Automatic project structure inference

David MacIver has an interesting blog entry up about determining logical project structure via commit logs. I was very interested because one of Cassandra's oldest issues is creating categories for our JIRA instance. (I've never been a big fan of JIRA, but you work with the tools you have. Or the ones the ASF inflicts on you, in this case.)

The desire to add extra work to issue reporting for a young project like Cassandra strikes me as slightly misguided in the first place. I have what may be an excessive aversion to overengineering, and I like to see a very clear benefit before adding complexity to anything, even an issue tracker. Still, I was curious to see what David's clustering algorithm made of things. And after pestering him to show me how to run his code I figure I owe it to him to show my results.

In general it did a pretty good job, particularly with the mid-sized groups of files. The large groups are just noise; the small groups, well, it's not exactly a revelation that Filter and FilterTest go together. I'd be tempted to play with it more but with only about two months and 250 commits in the apache repo there's not really all that much data there. (Cassandra's first two years were in an internal Facebook repository.) Working with data that exists as a side effect of natural activity is fascinating.

Comments

The noisy large groups are indeed a problem with this technique: It seems to be a limitation of the clustering algorithm I'm using.

I have a different algorithm based on it which seems to produce better results by allowing overlapping clusters, but unfortunately I can't release it yet. (It's proprietary code). We're hoping to release it as open source at some point, but for the moment it's a no go. Maybe it will work better for your code base when we do. :-)

Popular posts from this blog

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL . It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience. The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why: Standards compliance Completeness Maintainability Beauty Standards compliance SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL. The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas

A review of Lambda School from the father of a recent graduate

Background I’ve been a professional developer for twenty years.  I exposed my son N to programming a couple times while he was growing up --  Scratch when he was around 8, Khan Academy javascript when he was 12.  He learned it easily enough but it didn’t grab him. But his junior year in high school he had a hole in his schedule and I convinced him to try AP CS to fill it.  And this time, he got hooked.  He started programming for fun in the evenings.  You know how it goes. Then in March 2020, Covid hit and his high school went virtual.  It was a terrible experience, to the point that instead of going back for more his senior year, he took the last classes he needed to graduate over the summer, and decided to apply to programming boot camps in the fall.  I think the American college system is broken , so I was happy to help evaluate his options for something different. Evaluating boot camps N and I came up with three criteria for evaluating boot camps.  If they didn’t meet these three,