halvbrax

Was writing some code that walked a tree in dulwich, but it was terribly slow. A regular git ls-tree -r on the same tree object however finished in under 100 ms.

The first awful lstree for dulwich

def lstree(repo, tree):
    from os.path import join
    queue = [('', tree)]

    while len(queue) > 0:
        base, tree = queue.pop()

        for entry in tree.iteritems():
            obj = repo[entry.sha]
            if isinstance(obj, Tree):
                queue.append((join(base, entry.path), obj))
            elif isinstance(obj, Blob):
                print(entry.sha, join(base, entry.path))

This version ran somewhere just short of 30s with a warm fs cache. Two take aways: something is seriously wrong and my git repo is quite the beast.

One thing the above code does is waste time reading objects that will never be used, like the blobs. Well, except for the isinstance check. A second version that avoids reading anything but trees cuts it down to 2-3 seconds.

def lstree(repo, sha):
    from os.path import join
    queue = [('', sha)]

    while len(queue) > 0:
        base, sha = queue.pop()
        tree = repo[sha]

        for entry in tree.iteritems():
            if entry.mode == 0o40000:
                queue.append((join(base, entry.path), entry.sha))
            elif entry.mode == 0o100644:
                print(entry.sha, join(base, entry.path))

With the glaring mistake out of the way it starts to get tricky to cut down the time consumption further. using tree._entries.iteritems() rather than tree.iteritems() shaves off another 100 ms but nothing exciting.

One tree stands out by taking about 70 ms alone to process. It's a folder with close to 20k subfolders. I narrow it down with some code to only read this tree and get about the same time measurement. 67 ms to load the tree and 34 more to iterate over it with a simple pass. A check with the profiler reveals that _deserialize of the Tree class stands for most of the time followed by parse_tree which is directly called by the first and oddly enough both have a call count of 3.

I will have to investigate further.