Git Merge 2016

A belated report back on the Git Merge 2016 conference, held over two days in April.

Roberto Tyley

Published on Saturday, 2 July 2016

Computing   Git   Open source   Software  

The stage of Git Merge 2016
The stage of Git Merge 2016 Photograph: Roberto Tyley/The Guardian

The GitMerge 2016 conference was held in New York over two days in April - the first day was the Git core contributors conference, and the second was open to the general public, taking place on the stage of the off-broadway production of Avenue Q.

Day One: Core Contributors conference

The contributors conference was attended by five people from GitHub (including Peff, acting as MC), three each from Google and GitLab, with individual representation from Atlassian, AutoDesk, Bloomberg, Booking.com, and Twitter (all contributors present were male, sadly). The day took an unconference format, with topics suggested on a whiteboard, and discussed with the most popular topics first.

Big Repos and their related performance problems were the first topic suggested, which immediately needed a bit of clarification, because there are at least four ways in which a repo can be ‘big’, and they all have their own problems:

‘GitHub sees a lot of corner cases’ - a slide from Git Merge 2016
A slide from Patrick Reynolds’ talk ‘Scaling at GitHub’ - giving examples of some of the ways that repositories can be big

Switching Git to use a cryptographically-secure hashing function has been a topic at Git-togethers for at least half a decade - Git uses SHA-1 for it’s hashing function, which works brilliantly for distribution of object ids, and more or less adequately as a checksum for Git history. SHA-1 is a ‘strong’ hash, but has been considered compromised as a cryptographically secure hash for a long time.

As far as Git goes, it’s possible to debate whether Git needs a hash that is just ‘strong’, or actually cryptographically secure. The original line taken by Linus Torvalds was that the hash is a convenient guard against data corruption (“what you put in is what you get out”), but that in any case, once you’ve got history, it can’t be changed because every time objects are transferred into your Git repo (with a git fetch or pull) any incoming objects that match a hash you already have will be fully checked bit-by-bit against your existing copy of the object. So no matter how weak the hash is, you can’t change history within an existing copy of a repo. This is a great property, but it’s less useful if you’ve never fetched a copy of the repo before! The weakness of the SHA-1 hash is also unhelpful when you want to cryptographically sign a commit- if you want to certify the commit with all it’s history is trusted and you can’t rely on the integrity of the hash, the signature isn’t secure unless it’s been generated by extracting and processing all commits and files of that history.

Given that, updating Git to use a stronger hash function sounds attractive, but like the attempts to sunset SHA-1 SSL certificates, it won’t happen soon. Peff outlined 8 or 9 steps that would have to be undertaken to get Git updated - starting off with the relatively easy step of unifying the code within Git itself that handles object hashes. A new format for extended Git ids would need to be agreed - would everyone just adopt SHA-256, or would the hash name (eg. “sha-256”) become a prefix of the new object id format (as used in Git LFS)? Could new ids sit alongside old? There are parts of the Git data model where object ids are represented as strings (ie within commit object headers, where you could easily add new optional ‘extended-id’ headers, and others, eg ‘tree’ objects, where they’re taken as raw binary data of precisely 160 bits. This means that ‘baking-in’ the new hash, so that it becomes a first-class citizen of Git, would inevitably break backward compatibility.

Breaking backward compatibility is a problem given the massive installed user-base of Git clients & servers - old Git clients would report new repos as corrupt, and just blow up. As well as core-Git, significant Git libraries like libgit2 & JGit would need to be updated. Git hosting services would need to get behind the change - and they would obviously be reluctant to take on the resulting tech-support nightmare of a backward-incompatible change. See also this discussion from the Git mailing list archive.

submodules are a little notorious in the Git world - but Stefan Beller of Google has been working to make them better! A fairly lively discussion around the pain-points of submodules was had, providing plenty of input which was welcomed by Stefan.

submitGit and making it easier for people to contribute to Git was a topic following on directly from Git Merge 2015, where several Git developers expressed their dissatisfaction with Git’s current mailing-list based contribution process. The mailing list works well for the power-users, but it’s a method of contribution that is completely unfamiliar to the majority of today’s Git users - and these are people who could often have a good perspective on how to improve Git’s usability and documentation! There wasn’t anything like a consensus among Git core contributors to move away from the mailing list approach, but it was suggested that it might be possible to create a bridge that provided an alternative, more friendly way to sending patches to the list.

As a consequence, I developed submitGit, a one-way GitHub Pull Request -> Mailing-List tool and announced it to the mailing list in May 2015 where it was appreciatively received. I haven’t been able to spend as much time working submitGit as I’d like, and so it still misses features which limit it’s adoption, but it has still had a positive impact. There have been 44 contributors to Git over the past year, which makes the 23 users of submitGit a significant cohort. Pranit Bauva, a student working on Git as his Google Summer of Code project, has used submitGit for all his patch contributions, after discovering that his internet proxy blocked the email protocols necessary to use the standard Git mailing list process.

There was general agreement from those present at Git Merge 2016 that we could proceed with making submitGit a more ‘official’ tool for contribution - and I still need to complete the documentation updates to make that happen…

Day Two: Main conference

Greg Kroah-Hartman describes the Linux Kernel Development process
Greg Kroah-Hartman describes the Linux Kernel Development process

Greg Kroah-Hartman of the Linux Foundation opened the talks with a charismatic presentation on Linux Kernel Development and how it’s thriving with Git, with an ever accelerating number of commits every day. As someone who was trying to free people from mailing-list based contribution, it was very interesting for me to hear his enthusiastic arguments in favour of it - their global society of experienced devs is well-served by the format and pace expectations of email (slower than IRC, giving non-English speakers the opportunity to take their time, run google-translate, etc, when responding to messages).

Patrick Reynolds gave a very interesting talk on ‘Scaling at GitHub’ - with a large portion of the problem being the need to ensure that unusual or extreme repositories and their users didn’t take down GitHub for everyone else - certain patterns of behaviour can burn substantial CPU time! Many devs, loving GitHub, have sought to use it to store more than just code and make it into a CDN for artifacts too - this doesn’t always work out

In the past the CocoaPods community has experienced very slow fetches and clones, caused by automatic rate limiting by GitHub to ensure stability of their service for other users.
In the past the CocoaPods community has experienced very slow fetches and clones, caused by automatic rate limiting by GitHub to ensure stability of their service for other users. Photograph: Roberto Tyley/The Guardian

It was also great to hear more from Tim Pettersen about how Atlassian, GitHub and Microsoft have been collaborating on the open-source Git LFS project - things have come a long way since that surprising coincidence in Paris last year.

Finally, it was nice to see my project the BFG being mentioned in so many talks - several times for the ‘convert-to-git-lfs’ support added in v1.12.5 :

Many thanks to the organisers for a great conference - and to the participants for helping to make Git even better, and more usable!

Continue reading

Coming in from the cold: Routes to becoming a software engineer Pulling back the curtain: building the Guardian's sous chef bot