8. Git Automation

In this chapter, we’ll introduce advanced techniques for automating Git. In the first section about Git attributes, we’ll show you how to tell Git to treat certain files separately, for example, to call an external diff command on graphics.

We continue with hooks — small scripts that are executed when various git commands are called, for example to notify all developers via email when new commits arrive in the repository.

Then we’ll give a basic introduction to scripting with Git and show you useful plumbing commands.

Finally, we will introduce the powerful filter-branch command, which you can use to rewrite the project history on a large scale, for example to remove a file with a password from all commits.

8.1. Git Attributes — Treating Files Separately

Git attributes allow you to assign specific properties to individual files or a group of files so that Git treats them with special care; examples would be forcing the end of lines or marking certain files as binary.

You can write the attributes either in the file .gitattributes or .git/info/attributes. The latter is for a repository and is not managed by Git. A .gitattributes file is usually checked in, so all developers use these attributes. You can also store additional attribute definitions in subdirectories.

One line in this file has the format:

<pattern> <attrib1> <attrib2> ...

An example:

*.eps   binary
*.tex   -text
*.c     filter=indent

Usually attributes can be set (e.g. `binary`), canceled (-text) or set to a value (filter=indent). The man page gitattributes(5) describes in detail how Git interprets the attributes.

A project that is developed in parallel on Windows and Unix machines suffers from the fact that the developers use different conventions for line endings. This is due to the operating system: Windows systems use a carriage return followed by a line feed (CRLF), while unixoid systems use only a line feed (LF).

By means of suitable git attributes you can determine an adequate policy — in this case the attributes text or eol are responsible. The attribute text causes the line ends to be "normalized". Whether a developer’s editor uses CRLF or just LF, Git will only store the version with LF in the blob. If you set the attribute to auto, Git will only perform this normalization if the file also looks like text.

The eol attribute, on the other hand, determines what happens during a checkout. Regardless of the user’s core.eol setting, you can specify e.g. CRLF for some files (because the format requires it).

*.txt   text
*.csv   eol=crlf

With these attributes, .txt files are always saved internally with LF and checked out as CRLF if required (platform- or user-dependent). CSV files on the other hand are checked out with CRLF on all platforms. (Internally, Git will save all these blobs with simple LF extensions).

8.1.1. Filter: Smudge and Clean

Git offers a filter to "smudge" files after a checkout and to "clean" files again before a git add.

The filters do not get any arguments, but only the content of the blob on standard in. The output of the program is used as new blob.

For each filter you have to define a Smudge and a Clean command. If one of the definitions is missing or if the filter is cat, the blob is taken over unchanged.

Which filter is used for which type of files is defined by the git attribute filter. For example, to automatically indent C files correctly before a commit, you can use the following filter definitions (instead of <indent>, any other name is possible):

$ git config filter.<indent>.clean indent
$ git config filter.<indent>.smudge cat
$ echo '*.c filter=<indent>' > .git/info/attributes

To "clean up" a C file, Git now automatically calls the indent program that should be installed on standard systems.⁠[106]

8.1.2. Keywords in Files

So in principle the well-known keyword expansions can be realized, so that e.g. $Version$ becomes $Version: v1.5.4-rc2$.

You define the filters in your configuration and then equip corresponding files with this git attribute. This works like this, for example:

$ git config filter.version.smudge \~/bin/git-version.smudge
$ git config filter.version.clean ~/bin/git-version.clean
$ echo '* filter=version' > .git/info/attributes

A filter that replaces or cleans up the $Version$ keyword could be implemented as a Perl one-liner; first the Smudge filter:

#!/bin/sh
version=`git describe --tags`
exec perl -pe _s/$Version(:\s[^$]+)?$/$Version: _"$version"_$/g_

And the Clean-Filter:

#!/usr/bin/perl -p
s/$Version: [^$]+$/$Version$/g

It is important that repeated application of such a filter does not make uncontrolled changes in the file. A double call to Smudge should be fixed by a single call to Clean.

8.1.2.1. Restrictions

The concept of filters in Git is intentionally kept simple and will not be expanded in future versions. The filters receive no information about the context in which Git is currently located: Is a checkout happening? A merge? A diff? They only get the blob content. So the filters should only perform context-independent manipulations.

At the time Smudge is called, the HEAD may not yet be up to date (the above filter would write an incorrect version number to the file during a git checkout, because it is called before the HEAD is moved). So the filters are not very suitable for keyword expansion.

This may annoy users who have become accustomed to this feature in other version control systems. However, there are no good arguments for such an expansion within a version control system. The internal mechanisms Git uses to check if files have been modified are paralyzed (since they always have to go through the clean filter). Also, because of the structure of Git repositories, you can "track" a blob through the commits or trees, so you can always tell if a file belongs to a commit by its contents if necessary.

So keyword expansion is only useful outside of Git. This is not the responsibility of Git, but a Makefile target or script. For example, a make dist can replace all occurrences of VERSION with the output of git describe --tags. Git will display the files as "changed". Once the files are distributed (e.g. as a tarball), you can clean up with git reset --hard.

Alternatively, the export-subst attribute ensures that an expansion of the form $Format:<Pretty>$ is performed. Where <Pretty> must be a format that is valid for git log --pretty=format:<Pretty>, e.g. `%h` for the shortened commit hash sum. Git will only expand these attributes if the file is packaged via git archive (see Sec. 6.3.2, “Creating Releases”).

8.1.3. Own Diff Programs

Git’s internal diff mechanism is very well suited for all types of plaintext. But it fails with binaries - Git just tells you whether they differ or not. However, if you have a project where you need to manage binary data, such as PDFs, OpenOffice documents, or images, it’s a good idea to define a special program that creates meaningful diffs for these files.

For example, there are antiword and pdftotext to convert Word documents and PDFs to plaintext. There are analogous scripts for OpenOffice formats. For images you can use commands from the ImageMagick suite (see also the example below). If you manage statistical data, you can plot the changed recordsets side by side. Depending on the nature of the data, there are usually adequate ways to visualize changes.

Such conversion processes are, of course, lossy: You cannot use this diff output, for example to make meaningful changes to the files in a merge conflict. But to get a quick overview of who changed what, such techniques are sufficient.

8.1.3.1. API for External Diff Programs

Git provides a simple API for custom diff filters. A diff filter is always passed the following seven arguments:

  1. path (name of the file in the Git repository)

  2. old version of the file

  3. old SHA-1 ID of the blob

  4. old Unix rights

  5. new version of the file

  6. new SHA-1 ID of the blob

  7. new Unix rights

The arguments 2 and 5 may be temporary files, which will be deleted as soon as the diff program quits again, so you don’t have to care about cleaning up.

If one of the two files does not exist (newly added or deleted), then /dev/null is passed as file name. The corresponding blob is then 00000…​, even if a file does not yet exist as a fixed object in the object database (i.e. only in the working tree or index). The Diff command must be able to handle these cases accordingly.

8.1.3.2. Configuring External Diffs

There are two ways to call an external diff program. The first method is temporary: just set the environment variable GIT_EXTERNAL_DIFF to the path to your program before calling git diff:

$ GIT_EXTERNAL_DIFF=</pfad/zum/diff-kommando> git diff HEAD^

The other option is persistent, but requires some configuration. First you define your own diff command <name>:

$ git config diff.<name>.command </pfad/zum/diff-kommando>

The command needs to be able to handle the above mentioned seven arguments. Now you have to use the git-attribute diff to define, which diff-program is called. To do this, write e.g. the following lines in the .gitattributes file:

*.jpg diff=imgdiff
*.pdf diff=pdfdiff

When you check the file in, other users must also have set corresponding commands for imgdiff or pdfdiff, otherwise they will see the regular output. If you want to set this for one repository only, write this information to .git/info/attributes.

8.1.3.3. Comparing Pictures

A common use case are pictures: What has changed between two versions of an image? To visualize this is not always easy. The tool compare from the ImageMagick suite marks the places that have changed for images of the same size. You can also animate the two images one after the other and recognize by the "flickering" where the image has changed.

Instead, we want a program that compares the two images. Between the two images a kind of "difference" is displayed: All areas where changes have occurred are copied from the new image onto a white background. So the diff shows which areas have been added.

Therefore we save the following script under $HOME/bin/imgdiff:⁠[107]

#!/bin/sh

OLD="$2"
NEW="$5"

# "xc:none" ist "Nichts", entspricht einem fehlenden Bild
[ "$OLD" = "/dev/null" ] && OLD="xc:none"
[ "$NEW" = "/dev/null" ] && NEW="xc:none"

exec convert "$OLD" "$NEW" -alpha off \
    \( -clone 0-1 -compose difference -composite -threshold 0 \) \
    \( -clone 1-2 -compose copy_opacity -composite \
       -compose over -background white -flatten \) \
    -delete 2 -swap 1,2 +append \
    -background white -flatten x:

Finally, we need to configure the diff command and make sure it is used by an entry in the .git/info/attributes file.

$ git config diff.imgdiff.command ~/bin/imgdiff
$ echo "*.gif diff=imgdiff" > .git/info/attributes

As an example we use the original versions of the Tux.⁠[108] First we insert the black and white Tux:

$ wget http://www.isc.tamu.edu/~lewing/linux/sit3-bw-tran.1.gif \
  -Otux.gif
$ git add tux.gif && git commit -m "tux hinzugefügt"

It will be replaced by a colored version in the next commit:

$ wget http://www.isc.tamu.edu/~lewing/linux/sit3-bwo-tran.1.gif \ 
  -Otux.gif
$ git diff

The output of the git diff command is a window with the following content: On the left the old version, on the right the new version, and in the middle a mask of those parts of the new image that are different from the old.

tux diff
Figure 46. The output of git diff with the custom diff program imgdiff

The example with the Tux incl. manual can also be found in a repository at: https://github.com/gitbuch/tux-diff-demo.

8.2. Hooks

Hooks provide a mechanism to "hook" into important Git commands and perform your own actions. Therefore, hooks are usually small shell scripts to perform automated tasks, such as sending emails as soon as new commits are uploaded, or checking for whitespace errors before a commit and issuing a warning if necessary.

For hooks to be executed by Git, they must be located in the hooks/ directory in the Git directory, i.e. under .git/hooks/ or under hooks/ at the top level for bare repositories. They must also be executable.

Git automatically installs sample hooks on a git init, but these have the extension <hook>.sample and are therefore not executed without user intervention (renaming of files).

You can activate a supplied hook e.g. like this:

$ mv .git/hooks/commit-msg.sample .git/hooks/commit-msg

Hooks come in two classes: those that are executed locally (checking commit messages or patches, performing actions after a merge or checkout, etc.), and those that are executed server-side when you publish changes via git push.⁠[109]

Hooks whose name begins with pre- can often be used to decide whether or not to perform an action. If a pre-hook does not end successfully (i.e. with a non-zero exit status), the action is aborted. Technical documentation on how this works can be found in the githooks(5) man page.

8.2.1. Commits

pre-commit

Is called before the commit message is queried. If the hook terminates with a non-zero value, the commit process is aborted. The hook installed by default checks whether a newly added file has non-ASCII characters in the file name and whether there are whitespace errors in the modified files. With the -n or --no-verify option, git commit skips this hook.

prepare-commit-msg

Will be executed right before the message is displayed in an editor. Gets up to three parameters, the first of which is the file where the commit message is stored so that it can be edited. For example, the hook can add lines automatically. A non-zero exit status cancels the commit process. However, this hook cannot be skipped and therefore should not duplicate or replace the functionality of pre-commit.

commit-msg

Will be executed after the commit message is entered. The only argument is the file where the message is stored, so that it can be modified (normalization). This hook can be skipped by -n or --no-verify; if it does not terminate successfully, the commit process is aborted.

post-commit

Called after a commit has been created.

These hooks act only locally and are used to enforce certain policies regarding commits or commit messages. The pre-commit hook is especially useful for this. For example, some editors do not adequately indicate when there are spaces at the end of the line, or spaces contain spaces. Again, this is annoying when other developers have to clean up whitespace in addition to regular changes. This is where Git helps with the following command:

$ git diff --cached --check
hooks.tex:82: trailing whitespace.
+ auch noch Whitespace aufräumen müssen._

The --check option lets git diff check for such whitespace errors and will only exit successfully if the changes are error-free. If you write this command in your pre-commit hook, you will always be warned if you want to check in whitespace errors. If you are quite sure, you can simply suspend the hook temporarily with git commit -n.

Similarly, you can also store the "Check Syntax" command for a script language of your choice in this hook. For example, the following block for Perl scripts:

git diff --diff-filter=MA --cached --name-only |
while read file; do
    if [ -f $file ] && [ $(head -n 1 $file) = "#!/usr/bin/perl" ]; then
        perl -c $file || exit 1
    fi
done
true

The names of all files modified in the index (diff filter modified and added, see also Sec. 8.3.4, “Finding Changes”) are passed to a subshell that checks per file whether the first line is a Perl script. If so, the file is checked with perl -c. If there is a syntax error in the file, the command will issue an appropriate error message, and exit 1 will terminate the hook, so Git will abort the commit process before an editor is opened to enter the commit message.

The closing true is needed e.g. if a non-perl file was edited: Then the if construct fails, the shell returns the return value of the last command, and although there is nothing to complain about, Git will not execute the commit. With the line true the hook was successful if all passes of the while loop were successful.

The hook can of course be simplified by assuming that all Perl files are present as <name>.pl. Then the following code is sufficient:

git ls-files -z -- _*.pl_ | xargs -z -n 1 perl -c

Since you might want to check only the files managed by Git, a git ls-files is better than a simple ls, because that would also list untracked files ending in .pl.

Besides checking the syntax, you can of course also use Lint style programs that check the source code for "unsightly" or non portable constructs.

Such hooks are extremely useful to avoid accidentally checking in faulty code. If warnings are inappropriate, you can always skip the hook pre-commit by using the -n option when committing.

8.2.2. Server Side

The following hooks are called on the receiver side of git receive-pack after the user enters git push in the local repository.

For a push operation, git send-pack creates one packfile on the local side (see also Sec. 2.2.3, “The Object Database”), which is received by git receive-pack on the recipient side. Such a packfile contains the new values of one or more references as well as the commits required by the recipient repository to completely map the version history. The two sides negotiate which commits these are in advance (similar to a merge base).

pre-receive

The hook is called once and receives a list of changed references on standard input (see below for format). If the hook does not complete successfully, git receive-pack refuses to accept it (the whole push operation fails).

update

Is called once per changed reference and gets three arguments: the old state of the reference, the proposed new one and the name of the reference. If the hook does not end successfully, the update of the single reference is denied (in contrast to pre-receive, where only a whole packfile can be agreed or not).

post-receive

Similar to pre-receive, but is called only after the references have been changed (so it has no influence on whether the packfile is accepted or not).

post-update

After all references are changed, this hook is executed once and gets the names of all changed references as arguments. But the hook is not told, on which state the references were before or are now. (You can use post-receive for this.) A typical use case is a call to git update-server-info, which is necessary if you want to provide a repository via HTTP.

8.2.2.1. The Format of the Receive Hooks

The pre-receive and post-receive hooks get an equivalent input to standard input. The format is the following:

<alte-sha1> <neue-sha1> <name-der-referenz>

This can look like this, for example:

0000000...0000000 ca0e8cf...12b14dc refs/heads/newbranch
ca0e8cf...12b14dc 0000000...0000000 refs/heads/oldbranch
6618257...93afb8d 62dec1c...ac5373b refs/heads/master

A SHA-1 sum of all zeros means "not present". So the first line describes a reference that was not present before, while the second line means the deletion of a reference. The third line represents a regular update.

You can easily read the references with the following loop:

while read old new ref; do
  # ...
done

In old and new then the SHA-1 sums are stored, while ref contains the name of the reference. A git log $old..$new would list all new commits. The default output is forwarded to git send-pack on the page where git push was entered. So you can forward any error messages or reports directly to the user.

8.2.2.2. Sending E-Mails

A practical use of the post-receive hook is to send out emails as soon as new commits are available in the repository. You can program this yourself, of course, but there is a ready-made script that comes with Git. You can find it in the Git source directory under contrib/hooks/post-receive-email, and some distributions, such as Debian, also install it along with Git to /usr/share/doc/git/contrib/hooks/post-receive-email.

Once you have copied the hook into the hooks/ subdirectory of your bare repository and made it executable, you can adjust the configuration accordingly:

$ less config
...
[hooks]
  mailinglist = "Autor Eins <autor1@example.com>, autor2@example.com"
  envelopesender = "git@example.com"
  emailprefix = "[project] "

This means that for each push operation per reference, a mail is sent with a summary of the new commits. The mail goes to all recipients defined in hooks.mailinglist and comes from hooks.envelopesender. The subject line is prefixed with the hooks.emailprefix, so that the mail can be sorted away more easily. More options are documented in the comments of the hooks.

8.2.2.3. The Update Hook

The update hook is called for each reference individually. It is therefore particularly well suited to implement a kind of "access control" to certain branches.

In fact, the update hook is used by Gitolite (see Sec. 7.2, “Gitolite: Simple Git Hosting”) to decide whether a branch may be modified or not. Gitolite implements the hook as a Perl script that checks whether the appropriate permission is present and terminates with a zero or non-zero return value accordingly.

8.2.2.4. Deployment via Hooks

Git is a version control system and knows nothing about deployment processes. However, you can use the update hook to implement a simple deployment procedure - e.g. for web applications.

The following update hook will, if the master branch has changed, replicate the changes to /var/www/www.example.com:

[ "$3" = "refs/heads/master" ] || exit 0
env GIT_WORK_TREE=/var/www/www.example.com git checkout -f

So as soon as you upload new commits via git push to the server’s master branch, this hook will automatically update the web presence.

8.2.3. Applying Patches

The following hooks are each called by git am when one or more patches are applied.

applypatch-msg

Is called before a Patch is applied. The hook receives as its only parameter the file where the commit message of the patch is stored. The hook can change the message if necessary. A non-zero exit status causes git am not to accept the patch.

pre-applypatch

Called after a patch has been applied, but before the change is committed. A non-zero exit status causes git am not to accept the patch.

post-applypatch

Is called after a patch has been applied.

The hooks installed by default execute the corresponding commit hooks commit-msg and pre-commit, if enabled.

8.2.4. Other Hooks

pre-rebase

Is executed before a rebase process starts. Gets as arguments the references that are also passed to the rebase command (e.g. for the git rebase master topic command, the hook gets the arguments master and topic). Based on the exit status git rebase decides whether the rebase process is executed or not.

pre-push

Is executed before a push operation starts. Receives on standard input lines of the form <locale-ref>␣`<locale-sha1>`␣`<remote-ref>`␣`<remote-sha1>`. If the hook does not terminate successfully, the push process is aborted.

post-rewrite

Is called by commands that rewrite commits (currently only git commit --amend and git rebase). Receives a list in the format <old-sha1>␣`<new-sha1>` on standard input.

post-checkout

Is called after a checkout. The first two parameters are the old and new reference to which HEAD points. The third parameter is a flag that indicates whether a branch has been changed (1) or individual files have been checked out (0).

post-merge

Will be executed if a merge was successfully completed. The hook gets a 1 as argument if the merge was a so called squash-merge, i.e. a merge that did not create a commit but only processed the files in the working tree.

pre-auto-gc

Is called before git gc --auto is executed. Prevents execution of the automatic garbage collection if the return value is not zero.

You can use the post-checkout and post-commit hooks to teach Git "real" file permissions. This is because a blob object does not accurately reflect the contents of a file and its access rights. Instead, Git only knows "executable" or "non-executable".⁠[110]

The script stored in the git source directory under contrib/hooks/setgitperms.perl provides a ready-made solution that you can integrate into the above hooks. The script stores the real access rights in a .gitmeta file. If you do the read-in (option -r) in the pre-commit hook and give the hooks post-checkout and post-merge the command to write permissions (option -w), the permissions of your files should now be persistent. See the comments in the file for the exact commands.

The access rights are of course only stable between checkouts - unless you check in the .gitmeta file and force the use of the hooks, clones of this repository will of course only get the "basic" access rights.

8.3. Writing Your Own Git Commands

Git follows the Unix philosophy of "one tool, one job" with its division into subcommands. It also divides the subcommands into two categories: Porcelain and Plumbing.

Porcelain refers to the "good porcelain" that is taken out of the cupboard for the end user: a tidy user interface and human-readable output. Plumbing commands, on the other hand, are mainly used for "plumbing work" in scripts and have a machine-readable output (usually line by line with unique separators).

In fact, a substantial part of the Porcelain commands is implemented as shell script. They use the various plumbing commands internally, but present a comprehensible interface to the outside. The commands rebase, am, bisect and stash are just a few examples.

It is therefore useful and easy to write your own shell scripts to automate frequently occurring tasks in your workflow. These could be scripts that control the release process of the software, create automatic changelogs or other operations tailored to the project.

Writing your own git command is very easy: You just have to place an executable file in a directory of your $PATH (e.g. in ~/bin) whose name starts with git-. If you type git <command> and <command> is neither an alias nor a known command, Git will simply try to run git-<command>.

Even if you can write scripts in any language you like, we recommend using shell scripts: Not only are they easier to understand for outsiders, but above all, the typical operations used to combine Git commands - calling programs, redirecting output - are "intuitively" possible with the shell and do not require any complicated constructs, such as qx() in Perl or os.popen() in Python.

When writing shell scripts, please pay attention to POSIX compatibility!⁠[111] This includes in particular not using "bashisms" like [[ …​ ]] (the POSIX equivalent is [ …​ ]). If your script does not run without problems with Dash⁠[112], you should explicitly specify the shell used in the shebang line, e.g. via #!/bin/bash.

All scripts presented in the following section can also be found online, in the script collection for this book.⁠[113]

8.3.1. Initialization

Typically, you want to ensure that your script is executed in a repository. For necessary initialization tasks, Git offers the git-sh-setup. You should include this shell script directly after the shebang line using . (known as source in interactive shells):

#!/bin/sh

. $(git --exec-path)/git-sh-setup

Unless Git can detect a repository, git-sh-setup will abort. Also, the script will abort if it is not running at the top level in a repository. Your script will not be executed and an error message will be displayed. You can work around this behavior by setting the NONGIT_OK or SUBDIRECTORY_OK variable before the call.

Beside this initialization mechanism there are some functions available, which do frequently occurring tasks. Below is an overview of the most important ones:

cd_to_toplevel

Switches to the top level of the Git repository.

say

Outputs the arguments, unless GIT_QUIET is set.

git_editor

Opens the editor set for Git on the specified files. It’s better to use this function than "blind" `$EDITOR`. Git also uses this as a fallback.

git_pager

Opens the pager defined for Git.

require_work_tree

The function terminates with an error message if there is no working tree to the repository — this is the case with bare repositories. So you should call this function for security reasons if you want to access files from the working tree.

8.3.2. Position in the Repository

In scripts you will often need the information from which directory the script was called. The Git command rev-parse offers some options for this. The following script, stored under ~/bin/git-whereami, illustrates how to "find your way" within a repository.

#!/bin/sh

SUBDIRECTORY_OK=Yes
. $(git --exec-path)/git-sh-setup

gitdir="$(git rev-parse --git-dir)"
absolute="$(git rev-parse --show-toplevel)"
relative="$(git rev-parse --show-cdup)"
prefix="$(git rev-parse --show-prefix)"

echo "gitdir    absolute    relative    prefix"
echo "$gitdir   $absolute   $relative   $prefix"

The output looks like this:

$ git whereami
gitdir          absolute    relative    prefix
.git            /tmp/repo
$ cd very/deep
$ git whereami
gitdir          absolute    relative    prefix
/tmp/repo/.git  /tmp/repo   ../../      very/deep/

Especially important is the prefix you get via --show-prefix. If your command accepts filenames and you want to find the blobs they correspond to in the object database, you must put this prefix in front of the filename. If you are in the very/deep directory and give the script the file name README, it will find the corresponding blob in the current tree via very/deep/README.

8.3.3. List References: rev-list

The core of the plumbing commands is git rev-list (revision list). Its basic function is to resolve one or more references to the SHA-1 sum(s) to which they correspond.

With a git log <ref1>..<ref2> you display the commit messages from <ref1> (exclusive) to <ref2> (inclusive). The git rev-list command resolves this reference to the individual commits that are affected and prints it out line by line:

$ git rev-list master..topic
f4a6a973e38f9fac4b421181402be229786dbee9
bb8d8c12a4c9e769576f8ddeacb6eb4eedfa3751
c7c331668f544ac53de01bc2d5f5024dda7af283

So a script that operates on one or more commits can simply pass information to rev-list, as other Git commands understand it. Your script can even handle complicated expressions.

You can use the command, for example, to check whether fast forward from one branch to another is possible. Fast forward from <ref1> to <ref2> is possible if Git can reach the commit marked by <ref1> in the commit graph of <ref2>. In other words, there is no commit reachable from <ref1> that can’t also be reached from <ref2>.

#!/bin/sh

SUBDIRECTORY_OK=Yes
. $(git --exec-path)/git-sh-setup

[ $# -eq 2 ] || { echo "usage: $(basename $0) <ref1> <ref2>"; exit 1; }

for i in $1 $2
do
    if ! git rev-parse --verify $i >| /dev/null 2>&1 ; then
        echo "Ref:_$i_ does not exist!" && exit 1
    fi
done

one_two=$(git rev-list $1..$2)
two_one=$(git rev-list $2..$1)

[ $(git rev-parse $1) = $(git rev-parse $2) ] \
&& echo "$1 and $2 point to the same commit!" && exit 2

[ -n "$one_two" ] && [ -z "$two_one" ] \
&& echo "FF from $1 to $2 possible!" && exit 0
[ -n "$two_one" ] && [ -z "$one_two" ] \
&& echo "FF from $2 to $1 possible!" && exit 0

echo "FF not possible! $1 and $2 are diverged!" && exit 3

The calls to rev-parse in the For loop check that the arguments are references that Git can resolve to a commit (or other database object) - if this fails, the script aborts with an error message.

The output of the script could look like this:

$ git check-ff topic master
FF von master nach topic möglich!

For simple scripts, which expect only a limited number of options and arguments, a simple evaluation of these, as in the above script, is completely sufficient. However, if you are planning a more complex project, the so-called getopt mode of git rev-parse is recommended. This mode allows syntax analysis of command line options and offers a similar functionality as the C-library getopt. For details see the git-rev-parse(1) man page, section "Parseopt".

8.3.4. Finding Changes

git diff and git log tell you to display information about the files that a commit has changed, using the --name-status option:

$ git log -1 --name-status 8c8674fc9
commit 8c8674fc954d8c4bc46f303a141f510ecf264fcd
...
M       git-pull.sh
M       t/t5520-pull.sh

Each name is preceded by one of five flags⁠[114], which are shown in the list below:

A (added)

File was added

D (deleted)

File was deleted

M (modified)

File was changed

C (copied)

File was copied

R (renamed)

File was renamed

The flags C and R are followed by a three-digit number indicating the percentage that has remained the same. So if you duplicate a file, this corresponds to the output C100. A file that is renamed and slightly modified in the same commit via git mv might show up as R094 - a 94% renaming.

$ git log -1 --name-status 0ecace728f
...
M       Makefile
R094    merge-index.c   builtin-merge-index.c
M       builtin.h
M       git.c

You can use these flags to search for commits that have changed a specific file using diff filters. For example, if you want to find out who added a file when, use the following command:

$ git log --pretty=format:'added by %an %ar' --diff-filter=A -- cache.h
added by Linus Torvalds 6 years ago

You can specify several flags to a diff filter directly after each other. The question "Who did most of the work on this file?" can often be answered by whose commits modified this file the most. This can be found out, for example, by doing the following:

$ git log --pretty=format:%an --diff-filter=M -- cache.h | \
  sort | uniq -c | sort -rn | head -n 5
    187 Junio C Hamano
    100 Linus Torvalds
     27 Johannes Schindelin
     26 Shawn O. Pearce
     24 Jeff King

8.3.5. The Object Database and rev-parse

The Git command rev-parse (revision parse) is an extremely flexible tool whose task is, among other things, to translate expressions describing commits or other objects of the object database into their complete SHA-1 sum. For example, the command converts abbreviated SHA-1 sums into the unique 40-character variant:

$ git rev-parse --verify be1ca37e5
be1ca37e540973bb1bc9b7cf5507f9f8d6bce415

The --verify option is passed to make Git print an appropriate error message if the passed reference is not a valid one.

However, the command can also abbreviate a SHA-1 sum with the --short option. The default is seven characters:

$ git rev-parse --verify --short be1ca37e540973bb1bc9b7cf5507f9f8d6bce415
be1ca37

If you want to find out the name of the branch that is currently checked out (as opposed to the commit ID), use git rev-parse --symbolic-full-name HEAD.

But rev-parse (and thus also all other git-commands, which accept arguments as references) supports even more possibilities to reference objects.

<sha1>^{<type>}

Follows the reference <sha1> and resolves it to an object of type <typ>. This way you can find the corresponding tree for a commit <commit> by specifying <commit>^{tree}. If you don’t specify an explicit type, the reference is resolved until Git finds an object that isn’t a tag (which is especially handy when you want to find the equivalent of a tag).

Many git commands do not work on a commit, but on the trees that are referenced (e.g. the git diff command, which compares files, i.e. tree entries). In the man page, these arguments are called tree-ish. Git expects arbitrary references, which can be resolved to a tree, with which the command then continues to work.

<tree-ish>:<path>

Resolves the path <path> to the corresponding referenced tree or blob (corresponds to a directory or file). The referenced object is extracted from <tree-ish>, which can be a tag, a commit or a tree.

The following example illustrates how this special syntax works: The first command extracts the SHA-1 ID of the tree referenced by HEAD. The second command extracts the SHA-1 ID of the blob corresponding to the README file at the top level of the git repository. The third command then verifies that this really is a blob.

$ git rev-parse 'HEAD^{tree}'
89f156b00f35fe5c92ac75c9ccf51f043fe65dd9
$ git rev-parse 89f156b00f:README
67cfeb2016b24df1cb406c18145efd399f6a1792
$ git cat-file -t 67cfeb2016b
blob

A git show 67cfeb2016b would now show the actual contents of the blob. By redirecting with > you can extract the blob as a file to the file system.

The following script first finds the commit ID of the commit that last modifies a particular file (the file is passed as the first argument, $1). Then the script extracts the file (with prefix, see above) from the predecessor of the commit ($ref~) that last modified the file, and saves it in a temporary file.

Finally, Vim is called in diff mode on the file and then the file is deleted.

#!/bin/sh

SUBDIRECTORY_OK=Yes
. $(git --exec-path)/git-sh-setup

[ -z "$1" ] && echo "usage: $(basename $0) <file>" && exit 1
ref="$(git log --pretty=format:%H --diff-filter=M -1 -- $1)"
git rev-parse --verify $ref >/dev/null || exit 1

prefix="$(git rev-parse --show-prefix)"
temp="$(mktemp .diff.$ref.XXXXXX)"
git show $ref^:$prefix$1 > $temp

vim -f -d $temp $1
rm $temp

To resolve a lot of references with rev-parse, you should do this in one program call: rev-parse will print one line for each reference. With dozens or even hundreds of references, the single call is resource-saving and therefore faster.

8.3.6. Iterating References: for-each-ref

A common task is to iterate references. Here, Git provides the general-purpose command for-each-ref. The common syntax is git for-each-ref --format=<format> <pattern>. You can use the pattern to restrict the references to be iterated, e.g. `refs/heads` or refs/tags. With the format expression you specify which properties of the reference should be output. It consists of different fields %(fieldname), which are expanded to corresponding values in the output.

refname

Name of the reference, e.g. `heads/master`. The addition :short shows the short form, i.e. master.

objecttype

Type of object (blob, tree, commit or tag)

objectsize

Object size in byte

object name

Commit ID or SHA-1 sum

upstream

Remote Tracking Branch of the Upstream Branch

Here is a simple example how to display all SHA-1 sums of the release candidates of version 1.7.1:

$ git for-each-ref --format='%(objectname)--%(objecttype)--%(refname:\
  short)' refs/tags/v1.7.1-rc*
bdf533f9b47dc58ac452a4cc92c81dc0b2f5304f--tag--v1.7.1-rc0
d34cb027c31d8a80c5dbbf74272ecd07001952e6--tag--v1.7.1-rc1
03c5bd5315930d8d88d0c6b521e998041a13bb26--tag--v1.7.1-rc2

Note that the separators "--" are taken over in this way and thus additional characters for formatting are possible.

Depending on the object type, other field names are also available, for example, for a tag the tagger field, which contains the tag author, his e-mail and the date. At the same time the fields taggername, taggeremail and taggerdate are available, each containing only the name, the e-mail and the date.

For example, if you want to know for a project who ever created a tag:

$ git for-each-ref --format='%(taggername)' refs/tags | sort -u
Junio C Hamano
Linus Torvalds
Pat Thoyts
Shawn O. Pearce

As a further interface different options are offered for script languages, --shell, --python, --perl and --tcl. Thus the fields are formatted accordingly as string literals in the respective language, so that they can be evaluated per eval and translated into variables:

$ git for-each-ref --shell --format='ref=%(refname)' refs/tags/v1.7.1.*
ref=_refs/tags/v1.7.1.1_
ref=_refs/tags/v1.7.1.2_
ref=_refs/tags/v1.7.1.3_
ref=_refs/tags/v1.7.1.4_

This can be used to write the following script, which prints a summary of all branches that have an upstream branch - including SHA-1 sum of the most recent commit, its author, and tracking status. The output is very similar to git branch -vv, but a bit more readable. The authorname field contains the name of the commit author, similar to taggername. The core is the eval "$data" statement, which translates the line-by-line output of for-each-ref into the variables used later.

#!/bin/sh
SUBDIRECTORY_OK=Yes
. $(git --exec-path)/git-sh-setup

git for-each-ref --shell --format=\
"refname=%(refname:short) "\
"author=%(authorname) "\
"sha1=%(objectname) "\
"upstream=%(upstream:short)" \
refs/heads | while read daten
do
    eval "$daten"
    if [ -n "$upstream" ] ; then
        ahead=$(git rev-list $upstream..$refname | wc -l)
        behind=$(git rev-list $refname..$upstream | wc -l)
        echo $refname
        echo --------------------
        echo     "    Upstream:    "$upstream
        echo     "    Last author: "$author
        echo     "    Commit-ID    "$(git rev-parse --short $sha1)
        echo -n  "    Status:      "
        [ $ahead  -gt 0 ] && echo -n "ahead:"$ahead" "
        [ $behind -gt 0 ] && echo -n "behind:"$behind" "
        [ $behind -eq 0 ] && [ $ahead -eq 0 ] && echo -n "synchron!"
        echo
    fi
done

The output will look like this:

$ git tstatus
maint
--------------------
    Upstream:    origin/maint
    Last author: João Britto
    Commit-ID    4c007ae
    Status:      synchron!
master
--------------------
    Upstream:    origin/master
    Last author: Junio C Hamano
    Commit-ID    4e3aa87
    Status:      synchron!
next
--------------------
    Upstream:    origin/next
    Last author: Junio C Hamano
    Commit-ID    711ff78
    Status:      behind:22
pu
--------------------
    Upstream:    origin/pu
    Last author: Junio C Hamano
    Commit-ID    dba0393
    Status:      ahead:43 behind:126

The other field names as well as examples can be found in the git-for-each-ref(1) man page.

8.3.7. Rewrite References: git update-ref

If you use for-each-ref, you usually want to edit references as well - therefore the update-ref command should be mentioned. With it you can create references and safely convert or delete them. Basically git update-ref works with two or three arguments:

git update-ref <ref> <new-value> [<oldvalue>]

Here is an example that moves the master to HEAD^ if it points to HEAD:

$ git update-ref refs/heads/master HEAD^ HEAD

Or to create a new reference topic at ea0ccd3:

$ git update-ref refs/heads/topic ea0ccd3

To delete references there is the option -d:

git update-ref -d <ref> [<oldvalue>]

For example to delete the reference topic again:

$ git update-ref -d topic ea0ccd3

Of course, you could also manipulate the references with commands like echo <sha> > .git/refs/heads/<ref>, but update-ref brings various safeguards and helps to minimize possible damage. The addition <oldvalue> is optional, but helps to avoid programming errors. It also takes care of special cases (symlinks whose target is inside or outside the repository, references pointing to other references, etc.). An additional advantage is that git update-ref automatically makes entries in the reflog, which makes troubleshooting much easier.

8.3.8. Extended Aliases

If you have only one one-liner, it is usually not worthwhile to create your own script. Git aliases were developed for this use case. For example, it is possible to call external programs by prefixing them with an exclamation mark, for example to simply call gitk --all with git k:

$ git config --global alias.k '!gitk --all'

Another example, which deletes all branches already merged and uses a concatenation of commands for this is:

prune-local = !git branch --merged | grep -v ^* | xargs git branch -d

With certain constructs, you may want to rearrange the arguments passed to the alias or use them within a command chain. The following trick is suitable for this, where a shell function is built into the alias:

$ git config --global alias.demo '!f(){ echo $2 $1 ; }; f'
$ git demo foo bar
bar foo

This allows even more complex one-liners to be defined elegantly as aliases. The following construction filters out for a given file, which authors made how many commits in which the file was changed. If you send patches to the Git project’s mailing list, you are asked to send the mail via CC to the main authors of the files you changed. Use this alias to find out who they are.

who-signed = "!f(){ git log -- $1 | \
    grep Signed-off-by | sort | uniq --count | \
    sort --human-numeric-sort --reverse |\
    sed _s/Signed-off-by: / /_ | head ; } ; f "

There are some things to consider here: An alias is always executed from the toplevel directory of the repository, so the argument must contain the path inside the repository. The alias is also based on the fact that all people involved have signed off on the commit with a signed-off-by line, because these lines are used to generate the statistics. Since the alias is spread over several lines, it must be enclosed in quotes, otherwise Git cannot interpret the alias correctly. The final call to head limits the output to the top ten authors:

$ git who-signed Documentation/git-svn.txt
     46      Junio C Hamano <gitster@pobox.com>
     30      Eric Wong <normalperson@yhbt.net>
     27      Junio C Hamano <junkio@cox.net>
      5      Jonathan Nieder <jrnieder@uchicago.edu>
      4      Yann Dirson <ydirson@altern.org>
      4      Shawn O. Pearce <spearce@spearce.org>
      3      Wesley J. Landaker <wjl@icecavern.net>
      3      Valentin Haenel <valentin.haenel@gmx.de>
      3      Ben Jackson <ben@ben.com>
      3      Adam Roben <aroben@apple.com>

Further interesting ideas and suggestions can be found in the Git-Wiki on the page about aliases.⁠[115]

8.4. Rewriting Version History

The previously introduced git rebase command and its interactive mode allows developers to edit commits at will. Code that is still in development can be "cleaned up" before it is integrated (e.g. via merge) and thus permanently merged with the software.

But what if all commits are to be changed afterwards, or at least a large part of them? Such requirements arise, for example, when a previously private project is to be published, but sensitive data (keys, certificates, passwords) are included in the commits.

Git offers the filter-branch command to automate this task. Basically, it works like this: You specify a set of references that Git should rewrite. You also define commands that are responsible for modifying the commit message, tree contents, commits, etc. Git goes through each commit and applies the appropriate filter to the appropriate part. The filters are executed per eval in the shell, so they can be complete commands or names of scripts. The following list describes the filters that Git offers:

--env-filter

Can be used to adjust the environment variables under which the commit is rewritten. Especially the variables GIT_{AUTHOR,COMMITTER}_{NAME,EMAIL,DATE} can be exported with new values if needed.

--tree filter

Creates a checkout for each commit to be rewritten, changes to the directory and executes the filter. Afterwards, new files are automatically added and old ones deleted and all changes are applied.

--index filter

Manipulates the index. Behaves similar to the tree filter, except that Git doesn’t create a checkout, making the index filter faster.

--msg-filter

Receives the commit message on default-in and prints the new message on default-out.

--commit-filter

Is called instead of git commit-tree and can thus in principle make several commits from one. See the man page for details.

--tag-name filter

Will be called for all tag names that point to a commit that has been rewritten elsewhere. If you use cat as filter, the tags will be applied.

--subdirectory-filter

Only view the commits that modify the specified directory. The rewritten history will contain only this directory, as the topmost directory in the repository.

The general syntax of the command is: git filter-branch <filter> - <references>. Here <references> is an argument for rev-parse, so it can be one or more branch names, a syntax of the form <ref1>..<ref2> or simply --all for all references. Note the double bar --, which separates the arguments for filter-branch from those for rev-parse!

As soon as one of the filters does not end with the return value zero on a commit, the whole rewrite process will abort. So be careful to catch possible error messages or ignore them by appending || true.

The original references are stored under original/, so when you rewrite the master branch, original/refs/heads/master still points to the original, unrewritten commit (and its predecessor, accordingly). If this backup reference already exists, the filter-branch command will refuse to rewrite the reference unless you specify the -f option for force.

You should always do your filter-branch experiments in a fresh clone. The chance of causing damage by unfortunate typos is not insignificant. However, if you like the result, you can easily make the new repository the master repository, and also outsource the old one as a backup.

The following examples deal with some typical use cases of the filter-branch command.

8.4.1. Removing Sensitive Information Afterwards

Ideally, sensitive data such as keys, certificates or passwords are not part of a repository. Even large binary files or other data junk unnecessarily inflate the size of the repository.

Open source software, the use of which is permitted, but the distribution of which is prohibited by license terms ('no distribution'), may of course not appear in a repository that you make available to the public.

In all these cases you can rewrite the project history so that nobody can find out that the corresponding data ever appeared in the version history of the project.

If you are working with git tags, it is always a good idea to pass the --tag-name-filter cat argument as well, so that tags pointing to commits to be rewritten will also point to the new version.

To delete only some files or subdirectories from the entire project history, use a simple index filter. All you have to do is tell Git to remove the corresponding entries from the index:

$ git filter-branch --index-filter \
  'git rm --cached --ignore-unmatch <file>' \
  --prune-empty -- --all

The --cached and --ignore-unmatch arguments tell git rm to remove only the index entry, and not to abort with an error if the corresponding entry does not exist (e.g. because the file was not added until a particular commit). If you want to delete directories, you must also specify -r.

The argument --prune-empty makes sure that commits which do not change the tree after applying the filter are omitted. So if you have added a certificate with a commit, and this commit becomes an "empty" commit by removing the certificate, Git will omit it altogether.

Similar to the command above, you can also move files or directories with git mv. If the operations are a bit more complex, you should consider designing several simple filters and calling them one after the other.

It is possible that a file you want to delete had a different name in the past. To check this, use the command git log --name-status --follow - <file> to detect possible renames.

8.4.1.1. Removing Strings from Files

If you don’t want to change whole files, but only certain lines in all commits, a filter at index level is not sufficient. You must use a tree filter.

For each commit, Git will check out the relevant tree, change to the appropriate directory, and then run the filter. Any changes you make will be applied (without you having to use git add etc.).

To erase the password v3rYs3cr1T from all files and commits, the following commands are required:

$ git filter-branch --tree-filter 'git ls-files -z | \
  xargs -0 -n 1 sed -i "s/v3rYs3cr1T/PASSWORD/g" \
  2>/dev/null || true' -- master
Rewrite cbddbd3505086b79dc3b6bd92ac9f811c8a6f4d1 (142/142)
Ref _refs/heads/master_ was rewritten

The command performs an in-place replacement with sed on every file in the repository. Any error messages are neither issued nor do they cause the filter-branch call to be aborted.

After the references have been rewritten, you can use the pickaxe tool (-G<expression>, see Sec. 2.1.6, “Examining the Project History”) to verify that no commit really introduces the string v3rYs3cr1T anymore:

$ git log -p -G"v3rYs3cr1T"
# should not produce any output

Tree filters must check out the appropriate tree for each commit. This creates a considerable overhead for many commits and many files, so a filter-branch call can take a long time.

By specifying -d <path> you can instruct the command to check out the tree to <path> instead of .git-rewrite/. If you use a tmpfs here (especially /dev/shm or /tmp), the files are only held in memory, which can speed up the command call by several orders of magnitude.

8.4.1.2. Renaming a Developer

If you want to rename a developer, you can do this by changing the variable GIT_AUTHOR_NAME in an environment filter, if necessary. For example like this:

$ git filter-branch -f --env-filter \
  'if [ "$GIT_AUTHOR_NAME" = "Julius Plenz" ];
  then export GIT_AUTHOR_NAME="Julius Foobar"; fi' -- master

8.4.2. Extracting a Subdirectory

The Subdirectory filter allows you to rewrite the commits so that a subdirectory of the current repository becomes the new top-level directory. All other directories and the former top-level directory are dropped. Commits that have not changed anything in the new subdirectory are also dropped.

In this way, you can, for example, extract the version history of a library from a larger project. The exchange between the outsourced project and the base project can work via submodules or subtree-merges (see Sec. 5.11, “Managing Subprojects”).

To split the directory t/ (containing the test suite) from the git source repository, the following command is sufficient:

$ git filter-branch --subdirectory-filter t -- master
Rewrite 2071fb015bc673d2514142d7614b56a37b3faaf2 (5252/5252)
Ref _refs/heads/master_ was rewritten

Attention: This command runs for several minutes.

8.4.3. Grafts: Subsequent Merges

Git provides a way to simulate merges via so-called Graft Points or Grafts (to graft: plant). Such grafts are stored line by line in the file .git/info/grafts and have the following format:

commit [parent1 [parent2 ...]]

In addition to the information that Git gets from the commit metadata, you can also specify one or more parents for any commits.⁠[116]

Make sure to still consider the repository as a DAG and not close any circles: Do not define HEAD as the predecessor of the root commit! The grafts file is not part of the repository, so a git clone does not copy this information, it just helps Git find a merge base. However, when filter-branch is called, this graft information is hard-coded into the commits.

This is especially useful in two cases: If you import an old version history from a tool that cannot handle merges correctly (e.g. previous Subversion versions), or if you want to "glue" two version histories together.

Let’s assume the development was switched to Git. But nobody has taken care of converting the old version history. So the new repository was started with an initial commit that reflected the state of the project at that time.

Meanwhile, you’ve successfully converted the old version history to Git, and now you want to append it before the initial commit (or instead). To do this, proceed as follows:

$ cd <neues-repository>
$ git fetch <altes-repository> master:old-master
... Konvertierte Commits importieren ...

You now have a multi-root repository. You then need to find the initial commit of the new repository ($old_root) and define the latest commit of the old, converted repository ($old_tip) as its predecessor:

$ old_root=`git rev-list --reverse master | head -n 1`
$ old_tip=`git rev-parse old-master`
$ echo $old_root $old_tip > .git/info/grafts

Look at the result with Gitk or a similar program. If you are satisfied, you can make the grafts permanent (all commits starting at $old_tip are rewritten). To do this, call git filter-branch without specifying any filters:

$ git filter-branch -- $old_tip..
Rewrite 1591ed7dbb3a683b9bf1d880d7a6ef5d252fc0a0 (1532/1532)
Ref _refs/heads/master_ was rewritten
$ rm .git/info/grafts

Of course you also have to delete the remaining backup references (see below).

8.4.4. Deleting Old Commits

After you have removed any sensitive data from all commits, you still need to make sure that these old commits do not reappear. In the repository you rewrote, this is done in three steps:

  1. Delete the backup references under original/.

  2. You can do this with the following command:

    $ git for-each-ref --format='%(refname)' -- 'refs/original/' | \
      xargs -n 1 git update-ref -d

If you have not yet rewritten or deleted old tags or other branches, you must of course do this first.

  1. Delete the Reflog:

    $ git reflog expire --verbose --expire=now --all
  1. Delete the (orphaned) commits that are no longer accessible.

  2. The best way to do this is to use the gc option --prune, which sets the time since when a commit should be unreachable so that it is deleted:

  3. Now.

    $ git gc --prune=now

If other developers are working with an outdated version of the repository, they must now "migrate". It is essential that they do not use their development branches to pull old commits back into the cleaned up repository.

The best way to do this is to clone the new repository, fetch important branches from the old repository using git fetch, and rebase directly on the new commits. You can then dispose of the old commits using git gc --prune=now.


106. You can download the program indent from the GNU project from https://www.gnu.org/software/indent/.
107. The convert command is part of the ImageMagick suite. If you replace -clone 1-2 with -clone 0.2, the different areas are copied from the old image.
108. The graphics were created for the release of Kernel 2.0 by Larry Ewing and can be found at https://www.isc.tamu.edu/~lewing/linux/.
109. “Server-side” here only means that they are not executed in the local repository, but on the “opposite side”.
110. If Git were to include full permissions, then a file with the same contents would not be the same blob for two different developers using different umask(2) settings. To prevent this from happening, Git uses a simplified permission management system.
111. For example, you can have your shell scripts automatically checked at https://www.shellcheck.net/.
112. The Debian Alquimist Shell, a fork of the Alquimist Shell, is a very small, fast shell which is POSIX compatible. It provides the standard Shell /bin/sh on many modern Debian systems as well as on Ubuntu.
113. https://github.com/gitbuch/buch-scripte
114. There are other flags (U, T and B), but in practice they usually play no role.
115. https://git.wiki.kernel.org/index.php/Aliases
116. In principle, you cannot specify a predecessor. Then the corresponding commit becomes a root commit.