Git Book — 2. The Basics

2. The Basics

In this chapter, we’ll introduce you to the most important Git commands that you can use to manage your project files in Git. Understanding the Git object model is essential for advanced usage; we’ll cover this important concept in the second section of the chapter. While these explanations may seem overly theoretical at first, we encourage you to read them carefully. All further actions will be much easier for you with the knowledge of this background.

2.1. Git Commands

The commands you learned to get started (especially add and commit) work on the index. In the following, we will take a closer look at the index and the extended use of these commands.

2.1.1. Index

The content of files for Git resides on three levels: the working tree, the index, and the Git repository. The working tree corresponds to the files as they reside on your workstation’s file system — so if you edit files with an editor, search in them with grep, etc., you always operate on the working tree.

The repository is the repository for commits, that is, changes, with author, date, and description. The commits together make up the version history.

Unlike many other version control systems, Git now introduces a new feature, the index. It’s a somewhat elusive intermediate level between the working tree and the repository. Its purpose is to prepare commits. This means that you don’t always have to check in all the changes you have made to a file as commits.

The Git commands add and reset act (in their basic form) on the index, making changes to the index and deleting them again; only the commit command transfers the file to the repository as it is held in the index (Figure 1, “Commands add, reset and commit”).

Figure 1. Commands add, reset and commit

In the initial state, i.e. when git status outputs the message nothing to commit, the working tree and index are synchronized with HEAD. The index is therefore not “empty”, but contains the files in the same state as they are in the working tree.

Usually, the workflow is then as follows: First, you make a change to the working tree using an editor. This change is transferred to the index by add and finally saved in the repository by commit.

You can display the differences between these three levels using the diff command. A simple git diff shows the differences between the working tree and the index — the differences between the (actual) files on your working system and the files as they would be checked in if you called git commit.

The git diff --staged command, on the other hand, shows the differences between the index (also called the staging area) and the repository, that is, the differences that a commit would commit to the repository. In the initial state, when the working tree and index are in sync with HEAD, neither git diff nor git diff --staged produces output.

If you want to apply all changes to all files, there are two shortcuts: First, the -u or --update option of git add. This transfers all changes to the index, but does not yet create a commit. You can further abbreviate it with the -a or --all option of git commit. This is a combination of git add -u and git commit, which puts all changes to all files into one commit, bypassing the index. Avoid getting into the habit of using these options — they may be handy as shortcuts on occasion, but they reduce flexibility.

2.1.1.1. Word-Based Diff

An alternative output format for git diff is the so-called Word-Diff, which is available via the --word-diff option. Instead of the removed and added lines, the output of git diff shows the added (green) and removed (red) words with an appropriate syntax and color-coded.⁠^[12] This is useful when you are only changing single words in a file, for example when correcting AsciiDoc or LaTeX documents, because a diff is difficult to read if added and removed lines differ by only one word:

$ git diff
...
-   die Option `--color-words` zur Verfgung steht. Statt der entfernten
+   die Option `--color-words` zur Verfügung steht. Statt der entfernten
...

However, if you use the --word-diff option, only words that have been changed will be displayed marked accordingly; in addition, line breaks are ignored, which is also very practical because a reorientation of the words is not included as a change in the diff output:

$ git diff --word-diff
...
--color-words zur [-Verfgung-]{Verfügung} steht.
...

If you work a lot with continuous text, it is a good idea to set up an alias to abbreviate this command, so that you only have to type git dw, for example:

$ git config --global alias.dw "diff --word-diff"

2.1.2. Creating Commits Step by Step

But why create commits step-by-step — don’t you always want to check in all changes?

Yes, of course, you usually want to commit your changes completely. However, it can be useful to check them in step by step, for example, to better reflect the development history.

An example: You have worked intensively on your software project for the past three hours, but because it was so exciting, you forgot to pack the four new features into handy commits. In addition, the features are scattered over various files.

At best, you want to be selective, that is, you don’t want to commit all changes from one file, but only certain lines (functions, definitions, tests, …), and from different files.

Git’s index provides the flexibility you need for this. You collect some changes in the index and pack them into a commit — but all other changes are still preserved in the files.

We’ll illustrate this using the “Hello World!” example from the previous chapter. As a reminder, the contents of the hello.pl file

# Hello World! in Perl
print "Hello World!\n";

Now we prepare the file so that it has several independent changes that we don’t want to combine into a single commit. First, we add a shebang line at the beginning.⁠^[13] We also add a line naming the author, and the Perl statement use strict, which tells the Perl interpreter to be as strict as possible in its syntax analysis. It is important for our example that the file has been changed in several places:

#!/usr/bin/perl
# Hello World! in Perl
# Author: Valentin Haenel
use strict;
print "Hello World!\n";

With a simple git add hello.pl all new lines would be added to the index — so the state of the file in the index would be the same as in the working tree. Instead, we use the --patch option or short -p.⁠^[14] This has the effect that we are interactively asked which changes we want to add to the index. Git offers us each change one by one, and we can decide on a case-by-case basis how we want to handle them:

$ git add -p
diff --git a/hello.pl b/hello.pl
index c6f28d5..908e967 100644
--- a/hello.pl
+++ b/hello.pl
@@ -1,2 +1,5 @@
+#!/usr/bin/perl
 # Hello World! in Perl
+# Author: Valentin Haenel
+use strict;
 print "Hello World!\n";
Stage this hunk [y,n,q,a,d,/,s,e,?]?

This is where Git shows all changes, since they’re very close together in the code. If the changes are far apart or spread across different files, they’re offered separately. The term hunk refers to loosely connected lines in the source code. Some of the options we have at this point include the following:

Stage this hunk[y,n,q,a,d,/,s,e,?]?

The options are each only one letter long and difficult to remember. A small reminder is always given by [?]. We have summarized the most important options below.

`y` (yes)	Transfer the current hunk to the index.
`n` (no)	Don’t pick up the current hunk.
`q` (quit)	Do not pick up the current hunk or any of the following ones.
`a` (all)	Pick up the current hunk and all those that follow (in the current file).
`s` (split)	Try to split the current hunk.
`e` (edit)	Edit the current hunk.⁠^[15]

In the example we split the current hunk and enter s for split.

Stage this hunk [y,n,q,a,d,/,s,e,?]? [s]
Split into 2 hunks.
@@ -1 +1,2 @@
+#!/usr/bin/perl
 # Hello World! in Perl

Git confirms that the hunk was successfully split, and now offers us a diff that contains only the shebang line.⁠^[16] We specify y for yes and q for quit on the next hunk. To check if everything worked, we use git diff with the --staged option, which shows the difference between index and HEAD (the latest commit):

$ git diff --staged
diff --git a/hello.pl b/hello.pl
index c6f28d5..d2cc6dc 100644
--- a/hello.pl
+++ b/hello.pl
@@ -1,2 +1,3 @@
+#!/usr/bin/perl
 # Hello World! in Perl
 print "Hello World!\n";

To see which changes are not yet in the index, a simple call to git diff is enough to show us that — as expected — there are still two lines in the working tree:

$ git diff
diff --git a/hello.pl b/hello.pl
index d2cc6dc..908e967 100644
--- a/hello.pl
+++ b/hello.pl
@@ -1,3 +1,5 @@
 #!/usr/bin/perl
 # Hello World! in Perl
+# Author: Valentin Haenel
+use strict;
 print "Hello World!\n";

At this point we could create a commit, but for demonstration purposes we want to start from scratch. So we use git reset HEAD to reset the index.

$ git reset HEAD
Unstaged changes after reset:
M   hello.pl

Git confirms and names the files that have changes in them; in this case, it’s just the one.

The git reset command is in a sense the counterpart of git add: Instead of transferring differences from the working tree to the index, reset transfers differences from the repository to the index. Committing changes to the working tree is potentially destructive, as your changes may be lost. Therefore, this is only possible with the --hard option, which we discuss in Sec. 3.2.3, “Reset and the Index”.

If you frequently use git add -p, it is only a matter of time before you accidentally select a hunk you didn’t want. If the index was empty, this is not a problem since you can reset it to start over. It only becomes a problem if you have already recorded many changes in the index and don’t want to lose them, i.e. you remove a particular hunk from the index without wanting to touch the other hunks.

Analogous to git add -p there is the command git reset -p, which removes single hunks from the index. To demonstrate this, let’s first apply all changes with git add hello.pl and then run git reset -p.

$ git reset -p
diff --git a/hello.pl b/hello.pl
index c6f28d5..908e967 100644
--- a/hello.pl
+++ b/hello.pl
@@ -1,2 +1,5 @@
+#!/usr/bin/perl
 # Hello World! in Perl
+# Author: Valentin Haenel
+use strict;
 print "Hello World!\n";
Unstage this hunk [y,n,q,a,d,/,s,e,?]?

As in the example with git add -p, Git offers hunks one by one, but this time all the hunks in the index. Accordingly, the question is: Unstage this hunk [y,n,q,a,d,/,s,e,?]?, i.e. whether we want to remove the hunk from the index again. As before, by entering the question mark we get an extended description of the available options. At this point we press s once for split, n once for no and y once for yes. Now only the shebang line should be in the index:

$ git diff --staged
diff --git a/hello.pl b/hello.pl
index c6f28d5..d2cc6dc 100644
--- a/hello.pl
+++ b/hello.pl
@@ -1,2 +1,3 @@
+#!/usr/bin/perl
 # Hello World! in Perl
 print "Hello World!\n";

In the interactive modes of git add and git reset, you must press the Enter key after entering an option. The following configuration setting will save you this extra keystroke.

$ git config --global interactive.singlekey true

A word of warning: A git add -p may tempt you to check in versions of a file that are not executable or syntactically correct (e.g. because you forgot an essential line). So don’t rely on your commit being correct just because make — which works on working tree files! -- runs successfully. Even if a later commit fixes the problem, it will still be a problem, among other things, with automated debugging via bisect (see Sec. 4.8, “Finding Regressions — Git Bisect”).

2.1.3. Creating Commits

You now know how to exchange changes between working tree, index, and repository. Let’s turn to the git commit command, which you use to “commit” changes to the repository.

A commit keeps track of the state of all the files in your project at any given time, and also contains meta-information:⁠^[17]

Name of the authors and e-mail address
Name of the committer and e-mail address
Creation date
Commit date

In fact, the name of the author does not have to be the name of the committer (who commits). Often, commits are integrated or edited by maintainers (for example, by rebase, which also adjusts the committer information, see Sec. 4.1, “Moving commits — Rebase”). The committer information is usually of secondary importance, though — most programs only show the author and the date the commit was made.

When you create a commit, Git uses the user.name and user.email settings configured in the previous section to identify the commit.

If you call git commit without any additional arguments, Git will combine all changes in the index into one commit, and open an editor to create a commit message. However, the message will always contain instructions commented out with hash marks (#), or information about which files are changed by the commit. If you call git commit -v, you will still get a diff of the changes you will check in, below the instructions. This is especially useful for keeping track of the changes, and for using the auto-complete feature of your editor.

Once you exit the editor, Git creates the commit. If you don’t specify a commit message or delete the entire contents of the file, Git will abort and not create a commit.

If you only want to write one line, you can use the --message option, or short -m, which allows you to specify the message directly on the command line, thus bypassing the editor:

$ git commit -m "Dies ist die Commit-Nachricht"

2.1.3.1. Improving a Commit

If you rashly entered git commit, but want to make the commit slightly better, the --amend (“correct”) option helps. The option causes git to “add” the changes in the index to the commit you just made.⁠^[18] You can also customize the commit message. Note that the SHA-1 sum of the commit will change in any case.

The git commit --amend call only changes the current commit on a branch. Sec. 4.1.9, “Improving a Commit” describes how to improve past commits.

Calling git commit --amend automatically starts an editor, so you can edit the commit message as well. Often, however, you will only want to make a small correction to a file without adjusting the message. For authors, an alias fixup is useful in this situation:

$ git config --global alias.fixup "commit --amend --no-edit"

2.1.3.2. Good Commit Messages

What should a commit message look like? Not much can be changed in the outer form: The commit message must be at least one line long, but preferably no longer than 50 characters. This makes lists of commits easier to read. If you want to add a more detailed description (which is highly recommended!), separate it from the first line with a blank line. No line should be longer than 76 characters, as is usual for email.

Commit messages often follow the habits or specifics of a project. There may be conventions, such as references to the bug tracking or issue system, or a link to the appropriate API documentation.

Note the following points when writing a commit description:

Never create empty commit messages. Commit messages such as Update, Fix, Improvement, etc. are just as meaningful as an empty message — you might as well leave it at that.

Very important: Describe why something was changed and what the implications are. What has been changed is always obvious from the diff!

Be critical and note if you think there is room for improvement or the commit may introduce bugs elsewhere.

The first line should not be longer than 50 characters, so the output of the version history always remains well formatted and readable.

If the message becomes longer, a short summary (with the important keywords) should be in the first line. After a blank line follows an extensive description.

We can’t stress enough how important a good commit description is. When committing, a developer remembers the changes well, but after a few days, the motivation behind them is often forgotten. Your colleagues or project members will thank you, too, because they can commit changes much faster.

Writing a good commit message also helps to briefly reflect on what has been done and what is still to come. You may find that you’ve forgotten one important detail as you write it.

You can also argue about a timeline: The time it takes you to write a good commit message is a minute or two. But how much less time will the bug-finding process take if each commit is well documented? How much time will you save others (and yourself) if you provide a good description of a diff, which may be hard to understand? Also, the blame tool, which annotates each line of a file with the commit that last changed it, will become an indispensable tool for detailed commit descriptions (see Sec. 4.3, “Who Made These Changes? — Git Blame”).

If you are not used to writing detailed commit messages, start today. Practice makes perfect, and once you get used to it, the work will go quickly — you and others will benefit.

The Git repository is a prime example of good commit messaging. Without knowing the details of Git, you’ll quickly know who changed what and why. You can also see how many hands a commit goes through before it’s integrated.

Unfortunately, the commit messages in most projects are still very spartan, so don’t be disappointed if your peers are lazy about writing, but rather set a good example and provide detailed descriptions.

2.1.4. Moving and Deleting Files

If you want to delete or move files managed by Git, use git rm or git mv. They act like the regular Unix commands, but they also modify the index so that the action is included in the next commit.⁠^[19]

Like the standard Unix commands, git rm also accepts the -r and -f options to recursively delete or force deletion. git mv also offers an option -f (force) if the new filename already exists and should be overwritten. Both commands accept the option -n or --dry-run, which simulates the process and does not modify files.

To delete a file from the index only, use git rm --cached. It then remains in the working tree.

You will often forget to move a file via git mv or delete it via git rm, and use the standard Unix commands instead. In this case, simply mark the file (already deleted by rm) as deleted in the index, too, using git rm <file>.

To rename the file, proceed as follows: First mark the old file name as deleted using git rm <old-name>. Then add the new file: git add <new-name>. Then check via git status whether the file is marked as “renamed”.

Internally, it doesn’t matter to Git whether you move a file regularly via mv, then run git add <new-name> and git rm <old-name>. In any case, only the reference to a blob object is changed (seeSec. 2.2, “The Object Model”).

However, Git comes with a so-called Rename Detection: If a blob is the same and is only referenced by a different file name, Git interprets this as a rename. If you want to examine the history of a file and follow it if it is renamed, use the following command:

$ git log --follow -- <file>

2.1.5. Using Grep on a Repository

If you want to search for an expression in all files of your project, you can usually use grep -R <expression> ..

However, Git offers its own grep command, which you can call up using git grep <expression>. This command usually searches for the expression in all files managed by Git. If you want to examine only some of the files instead, you can specify the pattern explicitly. With the following command you can find all occurrences of border-color in all CSS files:

$ git grep border-color -- '*.css'

The grep implementation of Git supports all common flags that are also present in GNU Grep. However, calling git grep is usually an order of magnitude faster, since Git has significant performance advantages due to the object database and the multithreaded design of the command.

The popular grep alternative ack is characterized mainly by the fact that it combines the lines of a file matching the search pattern under a corresponding “heading”, and uses striking colors. You can emulate the output of ack with git grep by using the following alias:

$ git config alias.ack '!git -c color.grep.filename="green bold" \
  -c color.grep.match="black yellow" -c color.grep.linenumber="yellow bold" \
  grep -n --break --heading --color=always --untracked'

2.1.6. Examining the Project History

Use git log to examine the project’s version history. The options of this command (most of which also work for git show) are very extensive, and we will introduce the most important ones below.

Without any arguments, git log will output the author, date, commit ID, and the full commit message for each commit. This is handy when you need a quick overview of who did what and when. However, the list is a bit cumbersome when you’re looking at a lot of commits.

If you only want to look at recently created commits, limit git log’s output to n commits with the -<n> option. For example, the last four commits are shown with:

$ git log -4

To display a single commit, enter:

$ git log -1 <commit>

The <commit> argument is a legal name for a single commit, such as the commit ID or SHA-1 sum. However, if you do not specify anything, Git automatically uses HEAD. Apart from single commits, the command also understands so-called commit ranges (series of commits), see Sec. 2.1.7, “Commit-Ranges”.

The -p (--patch) option appends the full patch in Unified-Diff format below the description. Thus, a git show <commit> from the output is equivalent to git log -1 -p <commit>.

If you want to display the commits in compressed form, we recommend the --oneline option: It summarizes each commit with its abbreviated SHA-1 sum and the first line of the commit message. It is therefore important that you include as much useful information as possible in this line! For example, this would look like this:⁠^[20]

$ git log --oneline
25f3af3 Correctly report corrupted objects
786dabe tests: compress the setup tests
91c031d tests: cosmetic improvements to the repo-setup test
b312b41 exec_cmd: remove unused extern

The --oneline option is only an alias for --pretty=oneline. There are other ways to customize the output of git log. The possible values for the --pretty option are:

`oneline`	Commit-ID and first line of the description.
`short`	Commit ID, first line of the description and author of the commit; output is four lines.
`medium`	Default; output of commit ID, author, date and complete description.
`full`	Commit ID, author’s name, name of the committer and full description — no date.
`fuller`	Like `medium`, but additionally date and name of the committer.
`email`	Formats the information from `medium` so that it looks like an e-mail.
`format:⁠<string>`	Any format can be adapted by placeholders; for details see the man page `git-log(1)`, section “Pretty Formats”.

Independently of this, you can display more information about the changes made by the commit below the commit message. Consider the following examples, which clearly show which files were changed in how many places:

$ git log -1 --oneline 4868b2ea
4868b2e setup: officially support --work-tree without --git-dir

$ git log -1 --oneline --name-status 4868b2ea
4868b2e setup: officially support --work-tree without --git-dir
M       setup.c
M       t/t1510-repo-setup.sh

$ git log -1 --oneline --stat 4868b2ea
4868b2e setup: officially support --work-tree without --git-dir
 setup.c               |   19
 t/t1510-repo-setup.sh |  210 +++++++++++++++++------------------
 2 files changed, 134 insertions(), 95 deletions(-)

$ git log -1 --oneline --shortstat 4868b2ea
4868b2e setup: officially support --work-tree without --git-dir
 2 files changed, 134 insertions(+), 95 deletions(-)

2.1.6.1. Time Constraints

You can restrict the time of the commits to be displayed using the --after or --since and --until or --before options. The options are all synonymous, so they give the same results.

You can specify absolute dates in any common format, or relative dates, here are some examples:

$ git log --after='Tue Feb 1st, 2011'
$ git log --since='2011-01-01'
$ git log --since='two weeks ago' --before='one week ago'
$ git log --since='yesterday'

2.1.6.2. File-Level Restrictions

If you specify one or more file or directory names after a git log call, Git will only display the commits that affect at least one of the specified files. Provided a project is well structured, the output of commits can be severely limited and a particular change can be found quickly.

Since filenames may collide with branches or tags, you should be sure to specify the filenames after a -- which means that only file arguments follow.

$ git log -- main.c
$ git log -- *.h
$ git log -- Documentation/

These calls only output the commits in which changes were made to the main.c file, an .h file, or a file under Documentation/.

2.1.6.3. Grep for Commits

You can also search for commits in the style of grep, where the --author, --committer, and --grep options are available.

The first two options filter commits by author or committer name or address, as expected. For example, list all commits that Linus Torvalds has made since early 2010:

$ git log --since='2010-01-01' --author='Linus Torvalds'

You can also enter only part of the name or e-mail address here, so searching for 'Linus' would produce the same result.

For example, you can use --grep to search for keywords or phrases in the commit message, such as all commits that contain the word “fix” (not case-sensitive):

$ git log -i --grep=fix

The -i (or --regexp-ignore-case) option causes git log to ignore the pattern case (also works with --author and --committer).

All three options treat the values as regular expressions, just like grep (see the regex(7) man page). The -E and -F options change the behaviour of the options in the same way as egrep and fgrep: to use extended regular expressions or to search for the literal search term (whose special characters lose their meaning).

To search for changes, use the so-called Pickaxe tool. This will help you find commits whose diffs contain a certain regular expression (“grep for diffs”):

$ git log -p -G<regex>

The <regex> must be specified directly, i.e. without spaces, after the -G pickaxe option. The --pickaxe-all option causes all changes to the commit to be listed, not just those containing the change you are looking for.

Note that in earlier versions of Git, this operation was performed by the -S option, but it differs from -G in that it only finds the commits that change the number of times the pattern occurs — especially code shifts, i.e., removals and additions elsewhere in a file, are not found.

Equipped with these tools, you can now tame masses of commits yourself. Just specify as many criteria as you need to reduce the number of commits.

2.1.7. Commit-Ranges

So far, we’ve only looked at commands that require only a single commit as an argument, explicitly identified by its commit ID, or implicitly by the symbolic name HEAD, which references the most recent commit.

The git show command displays information about a commit, while the git log command starts at a commit, and then goes back in the version history until the beginning of the repository (called the root commit) is reached.

An important tool for specifying a series of commits is the so-called commit ranges in the form <commit1>..<commit2>. Since we have not yet worked with multiple branches, this is simply a range of commits in a repository, from <commit1> exclusive to <commit2> inclusive. If you omit one of the two boundaries, Git will take the value HEAD.

2.1.8. Differences between Commits

The command git show or git log -p has been used to show only the difference from the previous commit. If you want to see the differences between several commits, the command git diff.

The diff command performs several tasks. As already seen, you can examine the differences between the working tree and the index without specifying any commits, or the differences between index and HEAD with the --staged option.

However, if you pass two commits or a commit range to the command, the difference between these commits is displayed instead.

2.2. The Object Model

Git is based on a simple but extremely powerful object model. It is used to map the typical elements of a repository (files, directories, commits) and the development over time. Understanding this model is very important, and it helps to abstract from typical Git steps to better understand them.

In the following, we will again use a “Hello World!” program as an example, this time in the Python programming language.⁠^[21]

Figure 2. “Hello World!” Program in Python

The project consists of the file hello.py as well as a README file and a directory test. If you run the program with the command python hello.py, you will get the output: Hello World!. In the directory test is a simple shell script, test.sh, which displays an error message if the Python program does not output the string Hello World! as expected.

The repository for this project consists of the following four commits:

$ git log --oneline
e2c67eb Kommentar fehlte
8e2f5f9 Test Datei
308aea1 README Datei
b0400b0 Erste Version

2.2.1. SHA-1 — The Secure Hash Algorithm

SHA-1 is a secure hash algorithm that calculates a checksum of digital information: the SHA-1 sum. The algorithm was introduced in 1995 by the American National Institute of Standards and Technology (NIST) and the National Security Agency (NSA). SHA-1 was developed for cryptographic purposes and is used for checking the integrity of messages and as a basis for digital signatures. Figure 3, “SHA-1 Algorithm” shows how it works, where we calculate the checksum of hello.py.

The algorithm is a mathematical one-way function that maps a bit sequence of maximum length 2⁶⁴-1 bits (about 2 exbibytes) to a checksum of length 160 bits (20 bytes). The checksum is usually represented as a hexadecimal character string of length 40. The algorithm results in 2¹⁶⁰ (approx. 1.5 · 10⁴⁹) different combinations for this length of checksum, and therefore it is very, very unlikely that two bit sequences have the same checksum. This property is called collision safety.

Figure 3. SHA-1 Algorithm

Despite all efforts of cryptologists, several years ago various theoretical attacks on SHA-1 became known, which are supposed to make the generation of collisions possible with a considerable computing effort.⁠^[22] For this reason, NIST today recommends the use of the successors of SHA-1: SHA-256, SHA-384 and SHA-512, which have longer checksums and thus make the generation of collisions more difficult. On the Git mailing list there was a debate about switching to one of these alternatives, but this step was not considered necessary.⁠^[23]

This is because, although there is a theoretical attack vector on the SHA-1 algorithm, this does not compromise the security of Git. In fact, the integrity of a repository is not primarily protected by the collision resistance of an algorithm, but by the fact that many developers have identical copies of the repository.

The SHA-1 algorithm plays a central role in Git because it is used to build checksums of the data stored in the Git repository, the Git objects. This makes them easy to reference as SHA-1 sums of their contents. In your daily work with Git, you will usually only use SHA-1 sums of commits, known as commit IDs. This reference can be passed to many Git commands, such as git show and git diff. Depending on the repository, you often only need to specify the first few characters of an SHA-1 sum, since in practice a prefix is sufficient to uniquely identify a commit.

2.2.2. The Git Objects

All data stored in a Git repository is available as Git objects. There are four types:⁠^[24]

Table 1. Git Objects
Object	Saves…	References other objects	Correspondence
Blob	File content	No	File
Tree	Blobs and Trees	Yes	Directory
Commit	Project state	Yes, a tree and further commits	Snapshot/Archive at a time
Tag	Tag information	Yes, an object	Naming important snapshots or blobs

Figure 4, “Git Objects” shows three objects from the example project — a blob, a tree, and a commit.⁠^[25] The representation of each object includes the object type, the size in bytes, the SHA-1 sum, and the contents. The blob contains the content of the file hello.py (but not the file name). The tree contains references to one blob for each file in the project, i.e. one for hello.py and one for README, plus one tree per subdirectory, i.e. in this case only one for test. The files in the subdirectories are referenced separately in the respective trees that map these subdirectories.

Figure 4. Git Objects

So the commit object contains exactly one reference to a tree, and that reference is to the tree of the project content — this is a snapshot of the state of the project. The commit object also contains a reference to its direct ancestors, along with the metadata “author” and “committer” and the commit message.

Many Git commands expect a tree as an argument. However, because a commit, for example, references a tree, this is called a tree-ish argument. This refers to any object that can last be resolved to a tree. This category also includes tags (see Sec. 3.1.3, “Tags — Marking Important Versions”). Similarly, commit-ish is an argument that can be resolved to a commit.

File contents are always stored in blobs. Trees only contain references to blobs and other trees in the form of the SHA-1 sums of these objects. A commit in turn references a tree.

2.2.3. The Object Database

All Git objects are stored in the object database and are identifiable by their unique SHA-1 sum, i.e. you can find an object in the database by its SHA-1 sum once it has been stored. Thus, the object database basically functions like a large hash table, where the SHA-1 sums serve as keys for the stored contents:⁠^[26]

e2c67eb ⟶ commit
8e2f5f9 ⟶ commit
308aea1 ⟶ commit
b0400b0 ⟶ commit
a26b00a ⟶ tree
6cf9be8 ⟶ blob  (README)
52ea6d6 ⟶ blob  (hello.py)
c37fd6f ⟶ tree  (test)
e92bf15 ⟶ blob  (test/test.sh)
5b4b58b ⟶ tree
dcc027b ⟶ blob  (hello.py)
e4dc644 ⟶ tree
a347f5e ⟶ tree

You will first see the four commits that make up the Git repository, including the e2c67eb commit shown in Figure 4, “Git Objects”. This is followed by trees and blobs, each with file or directory correspondence. So-called top-level trees have no directory name: They refer to the top level of a project. A commit always references a top-level tree, so there are four of them.

The hierarchical relationship of the objects listed above is shown in Figure 5, “Hierarchical Relationship of Git Objects”. On the left-hand side, you can see the four commits that are already in the repository, and on the right-hand side, the referenced contents of the most recent commit (C4). As described above, each commit contains a reference to its direct predecessor (the resulting graph of commits is discussed below). This relationship is illustrated by the arrows pointing from one commit to the next.

Figure 5. Hierarchical Relationship of Git Objects

Each commit references the top-level tree — including the C4 commit in the example. The top-level tree in turn references the files hello.py and README in the form of blobs, and the subdirectory test in the form of another tree. Because of this hierarchical structure and the relationship of the individual objects to one another, Git is able to map the contents of a hierarchical file system as Git objects and store them in the object database.

2.2.4. Examining the Object Database

In a short digression we will go into how to examine the object database of Git. To do this, Git provides so-called plumbing commands, a group of low-level tools for Git, as opposed to the porcelain commands you usually work with. These commands are therefore not important for Git beginners, but are simply intended to give you a different approach to the concept of the object database. For more information, see Sec. 8.3, “Writing Your Own Git Commands”.

Let’s first look at the current commit. We’ll use the git show command with the --format=raw option, so let’s output the commit in raw format, so that everything this commit contains is displayed.

$ git show --format=raw e2c67eb
commit e2c67ebb6d2db2aab831f477306baa44036af635
tree a26b00aaef1492c697fd2f5a0593663ce07006bf
parent 8e2f5f996373b900bd4e54c3aefc08ae44d0aac2
author Valentin Haenel <valentin.haenel@gmx.de> 1294515058 +0100
committer Valentin Haenel <valentin.haenel@gmx.de> 1294516312 +0100

    Kommentar fehlte
...

As you can see, all the information in Figure 4, “Git Objects” is output: the SHA-1 sums of the commit, tree, and direct ancestor, plus the author and committer (including the date as a Unix timestamp), and the commit description. The command also provides the diff output for the previous commit — but this is not part of the commit, strictly speaking, and is therefore omitted here.

Next, let’s take a look at the tree referenced by this commit, using git ls-tree, a plumbing command to list the contents stored in a tree. It’s similar to ls -l, except that it is in the object database. With --abbrev=7 we shorten the output SHA-1 sums to seven characters.

$ git ls-tree --abbrev=7 a26b00a
100644 blob 6cf9be8  README
100644 blob 52ea6d6  hello.py
040000 tree c37fd6f  test

As in Figure 4, “Git Objects” the tree referenced by the commit contains one blob for each of the two files, and one tree (also: subtree) for the test directory. We can look at its contents again with ls-tree, since we now know the SHA-1 sum of the tree. As expected, you can see that the test tree references exactly one blob, the blob for the file test.sh.

$ git ls-tree --abbrev=7 c37fd6f
100755 blob e92bf15  test.sh

Finally, we make sure that the blob for hello.py really contains our “Hello World!” program and that the SHA-1 sum is correct. The command git show shows any objects. If we pass the SHA-1 sum of a blob, its contents are output. To check the SHA-1 sum we use the plumbing command git hash-object.

$ git show 52ea6d6
#! /usr/bin/env python

""" Hello World! """

print 'Hello World!'
$ git hash-object hello.py
52ea6d6f53b2990f5d6167553f43c98dc8788e81

A note for curious readers: git hash-object hello.py does not produce the same output as the Unix command sha1sum hello.py. This is because not only the file content is stored in a blob. Instead, the object type, in this case blob, and the size, in this case 67 bytes, are stored in a header at the beginning of the blob. The hash-object command therefore does not calculate the checksum of the file content, but of the blob object.

2.2.5. Deduplication

The four commits that make up the sample repository are shown again in Figure 6, “Repository Content”, but in a different way: The dashed bordered tree and blob objects indicate unchanged objects, all others were added or changed in the corresponding commit. The reading direction here is from bottom to top: at the bottom is C1, which contains only the file hello.py.

Since trees only contain references to blobs and other trees, each commit stores the status of all files, but not their contents. Normally, only a few files change during a commit. New blob objects (and therefore new tree objects) are now created for the new files or those to which changes have been made. However, the references to the unchanged files remain the same.

Figure 6. Repository Content

Even more: A file that exists twice only exists once in the object database. The contents of this file are stored as a blob in the object database and are referenced by a tree in two places. This effect is known as deduplication: Duplicates are not only prevented, but not made possible in the first place. Deduplication is an essential feature of Content-Addressable File Systems, i.e. file systems that know files only by their contents (such as Git, for example, by giving an object the SHA-1 sum of itself as “name”).

Consequently, a repository in which the same 1 MB file exists 1000 times takes up only slightly more than 1 MB. Git essentially has to manage the blob, plus a commit and a tree with 1000 blob entries (20 bytes each plus the length of the filename). A checkout of this repository, on the other hand, consumes about 1 GB of space on the filesystem because Git resolves deduplication.⁠^[27]

The git checkout and git reset commands restore a previous state (see also Sec. 3.2, “Restoring Versions”): You specify the reference of the corresponding commit, and Git searches for it in the object database. The reference is then used to find the tree object of this commit from the object database. Finally, Git uses the references contained in the tree object to find all other tree and blob objects in the object database and replicates them as directories and files on the file system. This allows you to restore exactly the project state that was saved with the commit at the time.

2.2.6. The Graph Structure

Because each commit stores its direct ancestors, a graph structure is created. More precisely, the arrangement of the commits creates a Directed Acyclic Graph (DAG). A graph consists of two core elements: the nodes and the edges connecting these nodes. In a directed graph, the edges are also characterized by a direction, which means that when you run the graph, you can only use the edges that point in the appropriate direction to move from one node to the next. The acyclic property rules out that you can find your way back to a node by any route through the graph. So you cannot move in a circle.⁠^[28]

Most Git commands are used to manipulate the graph: to add/remove nodes or to change the relation of the nodes to each other. You’ll know you’ve reached an advanced level of Git competency when you’ve internalized this rather abstract concept, and when you’re working with branches on a daily basis, you always think of the graph behind them. Understanding Git at this level is the first and only real hurdle to mastering Git safely in everyday life.

The graph structure is derived from the object model, because each commit knows its direct ancestor (possibly several in the case of a merge commit). The commits form the nodes of this graph — the references to ancestors form the edges.

An example graph is shown in Figure 7, “A Commit Graph”. It consists of several commits, which are colored to make it easier to distinguish between their affiliations to different development branches. First, the commits A, B, C, and D were made. They form the main development branch. Commits E and F contain feature development, which was transferred to the main development branch with commit H. Commit G is a single commit that has not yet been integrated into the main development branch.

Figure 7. A Commit Graph

One result of the graph structure is the cryptographically secured integrity of a repository. Git uses the SHA-1 sum of a commit to reference not only the contents of the project files at a given point in time, but also all commits executed up to that point, and their relationship to each other, i.e. the complete version history.

The object model makes this possible: each commit stores a reference to its ancestors. These references are then used to calculate the SHA-1 sum of the commit itself. So you get a different commit if you reference another ancestor.

Since the predecessor in turn references predecessors, and its SHA-1 sum depends on the predecessors, and so on, this means that the complete version history is implicitly encoded in the commit ID. Implicit here means: If even one bit of a commit changes anywhere in the version history, then the SHA-1 sum of subsequent commits, especially the topmost one, is no longer the same. The SHA-1 sum doesn’t say anything detailed about the version history, though; it’s just a checksum of it.

2.2.6.1. References: Branches and Tags

However, there is not much you can do with a pure commit graph. To reference (i.e., work with) a node, you need to know its name, which is the SHA-1 sum of the commit. In everyday use, however, you rarely use the SHA-1 sum of a commit directly, but instead use symbolic names, called references, which Git can resolve to the SHA-1 sum.

Git basically offers two types of references, branches and tags. These are pointers to a commit graph, which are used to mark specific nodes. Branches have a “moving” character, meaning that they move up as new commits are added to the branch. Tags, on the other hand, are static in nature, and mark important points in the commit graph, such as releases.

Figure 8, “Example of a Commit Graph with Branches and Tags” shows the same commit graph with the master, HEAD, feature, and bugfix branches. And the v0.1 and v0.2 tags.

Figure 8. Example of a Commit Graph with Branches and Tags

12. By default, words are separated by one or more spaces, but you can specify another regular expression to determine what a word is: git diff --word-diff-regex=. See also the git-diff(1) man page.

13. This is an instruction for the Kernel, telling it which program to use to interpret the script. Typical shebang lines include #!/bin/sh or #!/usr/bin/perl.

14. Strictly speaking, the -p option leads directly to the patch mode of git add⁠’s interactive mode. However, the interactive mode is rarely used in practice — in contrast to the patch mode — and is therefore not described further here. The documentation for this can be found in the git add(1) man page in the “Interactive Mode” section.

15. Git then opens the hunk in an editor; below is a guide to editing the hunk: To delete deleted lines (prefixed with -) — i.e. not add them to the index, but keep them in the working tree! — replace the minus sign with a space (the line becomes “context”). To delete + lines, simply remove them from the hunk.

16. However, you can usually not split hunks arbitrarily. At least one line of context, i.e. a line without prefix + or -, must be in between. If you still want to split the hunk, you have to use e for edit.

17. You can see this information in gitk or with the command git log --pretty=fuller.

18. In fact, Git creates a new commit whose changes are a combination of the changes made to the old commit and the index. The new commit then replaces the old one.

19. git rm deletes a file with the next commit, but it remains in the commit history. For information on how to delete a file completely, including from the version history, see Sec. 8.4.1, “Removing Sensitive Information Afterwards”.

20. This and the following examples are from the Git repository.

21. You can download the Git repository, which is examined in detail on the following pages, with the command:
git clone git://github.com/gitbuch/objektmodell-beispiel.git

22. https://en.wikipedia.org/wiki/SHA-1, “Attacks”.

23. https://web.archive.org/web/20120701221412/http://kerneltrap.org/mailarchive/git/2006/8/27/211001

24. The technical documentation is provided in the man page gittutorial-2(7).

25. The tag object is not shown here because it is not necessary for understanding the object structure. Instead, you will find it in Figure 12, “The Tag Object”.

26. Git stores all objects under .git/objects. A distinction is made between loose objects and packfiles. “Loose” objects store the content in a file whose name corresponds to the SHA-1 sum of the content (Git stores one file per object). In contrast, packfiles are compressed archives of many objects. This is done for performance reasons: Not only is the transfer or storage of these archives more efficient, but the file system is also relieved.

27. Internally, of course, Git has mechanisms to recognize blobs as deltas of other blobs and to tie them together to packfiles to save space.

28. These two properties, directional and acyclic, are the only necessary constraint to be placed on a graph that represents changes over time: Neither can future changes be referenced (the direction of the edges always points to the past), nor can you arrive at a point from which the path is already marked (circular reasoning).