Skip to main content

Merging two subversion repositories

Update: an anonymous commenter pointed out that yes, there is a (much!) better way to do this with svnadmin load --parent-dir, which is covered in the docs under "repository migration." All I can say in my defense is that it wasn't something google thought pertinent. So, for google's benefit: how to merge subversion repositories. Thanks for the pointer, anonymous!

I needed to merge team A's svn repository into team B's. I wanted to preserve the history of team A's commits as much as reasonably possible. I would have thought that someone had written a tool to do this, but I couldn't find one, so I wrote this. (Of course, now that I'm posting this, I fully expect someone to point me to a better, pre-existing implementation that I missed.)

The approach is to take a working copy of repository B, add a directory for A's code, and for each revision in A's repository, apply that to the working copy and commit it. This would be easy if svn merge would allow applying diffs from repository A into a working copy of repository B, but it does not. I can't think of a technical reason for this. (In fact, I seem to remember that early versions of the svn client did allow this, with dire warnings, but I could be mistaken and I don't have a 1.1 client around anymore.)

So I tried instead to use "svn diff |patch -p0", which worked great up until the first commit with a binary file. Oops. For the final version I ended up having to create a working copy for A, update to each revision there, then rsync to the right point in working copy B and call the "svnaddremove" script to mark files added or deleted. (This is suboptimal since we can get the exact changed paths from svn, and just copy those files over, but rsync is fast enough as long as your working copies stay in cache. The update and commit steps both consistently took longer than rsync in my timing.)

My script does not try to be intelligent about copies or moves that svn knows about. Team A did not use branches or tags much so I didn't put the effort in to deal with those the "right" way (which would be to also issue a cp/mv on B's working copy to preserve history). It also uses unix users to commit revisions with the same name as the original commit. Doing this obviously requires at least access to the repository server to add the right users. I used "svn log -q |grep ^r |awk '{print $3}' |sort |uniq |useradd."

Final note: the perl script in svnaddremove is a long way of writing "awk {print $2}", except that it preserves filenames with spaces in them. There is probably a much more clever way of doing this, too.

Here, then, is the merge script:

#!/usr/bin/python

# usage: svnimport <source wc path> <target wc path> <revstart> <revend>
# e.g. svnimport liberte-source trunk/liberte 1 2000

from subprocess import Popen, PIPE
try:
    from xml.etree import cElementTree as ET
except ImportError:
    from elementtree import ElementTree as ET
import sys, time

def system(*args):
    p = Popen(args, stdout=PIPE, stderr=PIPE)
    out, err = p.communicate()
    if err:
        raise err
    return out

# super-minimal log scraper
# for a better one see hgsvn's svnclient.py, http://cheeseshop.python.org/pypi/hgsvn
def parse_date(svn_date):
    date = svn_date.split('.', 2)[0]
    return time.strftime("%Y-%m-%d", time.strptime(date, "%Y-%m-%dT%H:%M:%S"))
def parse_svn_log_xml(xml):
    tree = ET.fromstring(xml)
    for entry in tree.findall('logentry'):
        d = {}
        d['revision'] = int(entry.get('revision'))
        author = entry.find('author')
        d['author'] = author is not None and author.text or None
        d['message'] = entry.find('msg').text or ""
        d['date'] = parse_date(entry.find('date').text)
        yield d
        
def edited_message(entry):
    msg = entry['message'].strip().replace('\r\n', '\n')
    addendum = '[original revision %s committed %s]' % (entry['revision'], entry['date'])
    if msg:
        return msg + '\n' + addendum
    return addendum

sourcepath, targetpath, revstart, revstop = sys.argv[1:]
# rsync foo bar and rsync foo/ bar/ are very different!
if not sourcepath.endswith('/'):
    sourcepath += '/'
if not targetpath.endswith('/'):
    targetpath += '/'

xml = system('svn', 'log', sourcepath, '--xml', '-r', '%s:%s' % (int(revstart) + 1, revstop))
for entry in parse_svn_log_xml(xml):
    revno = entry['revision']
    print 'merging revision %d by %s' % (revno, entry['author'])
    # merge in the revision
    system('svn', 'up', '-r', str(revno), sourcepath)
    print '\trsync'
    system('rsync', '-a', '--exclude=.svn', '--delete', sourcepath, targetpath)
    system('/tmp/svnaddremove', targetpath) # svn should add this.  hg already did.
    # commit as the correct author, if available
    author = entry['author']
    print '\tchown'
    system('chown', '-R', author, targetpath)
    quoted_message = edited_message(entry).replace('"', "'")
    print '\tci'
    system('su', author, '-c', 'svn ci %s -m "%s"' % (targetpath, quoted_message))

And here is svnaddremove:

#!/bin/bash

# odd, xargs is invoking svn add/rm w/ no args when grep returns no lines.
# fix that with the ifs.
# (don't use grep -q or svn gets pissed about broken pipe.)

if svn st $1 | grep ^\? > /dev/null; then
  svn st $1 | perl -ne 'chomp; @Fld = split(q{ }, $_, -1); if (/^\?/) { shift @Fld; print join(q{ }, @Fld) . "\n"; }' | xargs -n 1 -i svn add "{}"
fi
if svn st $1 | grep ^\! > /dev/null; then
  svn st $1 | perl -ne 'chomp; @Fld = split(q{ }, $_, -1); if (/^\!/) { shift @Fld; print join(q{ }, @Fld) . "\n"; }' | xargs -n 1 -i svn rm "{}"
fi

Comments

Anonymous said…
Isn't this possible with:

$ svnadmin dump repo1 > a1.d
$ svnadmin dump repo1 > a2.d
$ svnadmin create new
$ svn mkdir file:///.../new a1
$ svnadmin load new --parent-dir a1 < a1.d
$ svn mkdir file:///.../new a2
$ svnadmin load new --parent-dir a2 < a2.d

, or is there any semantic difference?
Jonathan Ellis said…
You're absolutely right. Damn. :)
Anonymous said…
can you tell me on how to merge two repository into one but with different folder.
ex:

/
main/
trunk/
branches/
tags/
/
proj1/
trunk/
branches/
tags/

proj2/
trunk/
branches/
tags/
Cybele said…
What would you like in your 2 folders merge in term of logging?

For instance 2 svn repository:
1:
Commit to proj1 on nov 29 at 8:00am for revision 1 on repo1
Commit to proj1 on nov 30 at 9:00am for revision 2 on repo1

2:
Commit to proj2 on nov 30 at 8:00am for revision 1 on repo2
Commit to proj2 on dec 1 at 11:00am for revision 2 on repo2

would look once merged :

Commit to proj1 on nov 29 at 8:00am for revision 1
Commit to proj2 on nov 30 at 8:00am for revision 2
Commit to proj1 on nov 30 at 9:00am for revision 3
Commit to proj2 on dec 1 at 11:00am for revision 4

Which would mean, "revision 2 on repo1/project1" became revision 3 because of the time and so became revision 2 from "revision 1 on repo2/project2", is it what you would expect?
Anonymous said…
Ahahaha RTFM!! :)

Popular posts from this blog

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL . It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience. The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why: Standards compliance Completeness Maintainability Beauty Standards compliance SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL. The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas

A review of 6 Python IDEs

(March 2006: you may also be interested the updated review I did for PyCon -- http://spyced.blogspot.com/2006/02/pycon-python-ide-review.html .) For September's meeting, the Utah Python User Group hosted an IDE shootout. 5 presenters reviewed 6 IDEs: PyDev 0.9.8.1 Eric3 3.7.1 Boa Constructor 0.4.4 BlackAdder 1.1 Komodo 3.1 Wing IDE 2.0.3 (The windows version was tested for all but Eric3, which was tested on Linux. Eric3 is based on Qt, which basically means you can't run it on Windows unless you've shelled out $$$ for a commerical Qt license, since there is no GPL version of Qt for Windows. Yes, there's Qt Free , but that's not exactly production-ready software.) Perhaps the most notable IDEs not included are SPE and DrPython. Alas, nobody had time to review these, but if you're looking for a free IDE perhaps you should include these in your search, because PyDev was the only one of the 3 free ones that we'd consider using. And if you aren