Appendix B: Repository Structure

Git stores the object database, the associated references, etc. in the so-called Git directory, often referred to as $GIT_DIR. By default, this is .git/. It exists only once for each Git repository, i.e. no additional .git/ directories are created in subdirectories.⁠[154] Among other things, it contains the following entries:

HEAD

The HEAD, see Sec. 3.1.1, “HEAD and Other Symbolic References”. Besides HEAD, other important symbolic references may be stored on the top level, e.g. ORIG_HEAD or FETCH_HEAD.

config

The repository configuration file, see Sec. 1.3, “Configuring Git”.

hooks/

Contains the hooks set for this repository, see Sec. 8.2, “Hooks”.

index

The index or stage, see Sec. 2.1.1, “Index”.

info/

Additional repository information, such as patterns to be ignored (see Sec. 4.4, “Ignoring Files”) and also grafts (see Sec. 8.4.3, “Grafts: Subsequent Merges”). You can put your own information there if other tools can handle it (see e.g. the section on caching of CGit, Sec. 7.5.4, “Exploiting Caching”).

logs/

Log of changes to references; accessible via Reflog, see Sec. 3.7, “Reflog”. Contains a log file for each reference under refs/ and HEAD.

objects/

The object database, see Sec. 2.2.3, “The Object Database”. For performance reasons, the objects are sorted into subdirectories that correspond to a two-character prefix of their SHA-1 sum (the commit 0a7ba55…​ is stored below 0a/7ba55…​). In the subdirectory pack/ you will find the packfiles and associated indices, which are created by the garbage collection (see below). In the info/ subdirectory, Git will store a list of existing pack files if required.

refs/

All references, including branches in refs/heads/, see Sec. 3.1.1, “HEAD and Other Symbolic References”, tags in refs/tags/, see Sec. 3.1.3, “Tags — Marking Important Versions”, and remote tracking branches under refs/remotes/, see Sec. 5.2.2, “Remote-Tracking-Branches”.

A detailed technical description can be found in the man page gitrepository-layout(5).

git dir crop
Figure 64. The most important entries in .git/

B.1. Cleaning Up

As mentioned in Sec. 3.1.2, “Managing Branches”, for example, commits that are no longer referenced (whether by branches or other commits) are no longer accessible. This is usually the case if you wanted to delete a commit (or have rebuilt commits with Rebase). Git does not delete them from the object database immediately, but leaves them there for two weeks by default, even if they are no longer accessible.

Internally, Git uses the commands prune, prune-packed, fsck, repack, etc. However, the tools are automatically executed by the garbage collection with appropriate options: git gc. The tool performs the following tasks:

  • Delete Dangling and Unreachable Objects. These occur during various operations and can usually be deleted after some time to save space (default: after two weeks).

  • Re-pack Loose Objects. Git uses so-called packfiles to pack several Git objects together. (Then there is no longer one file under .git/objects/ per blob, tree and commit — these are combined into one large, zlib-compressed file).

  • Search existing packfiles for old (unreachable) objects and “thin out” the packfiles accordingly. If necessary, several small packfiles are combined to large ones.

  • Delete old Reflog entries. By default this happens after 90 days.

The garbage collection has three modes: automatic, normal and aggressive. You call the automatic mode via git gc --auto — the mode checks if there are really blatant flaws in the Git repository. What “blatant” means is configurable. The following configuration settings allow you to determine (globally or per repository) when, i.e. how many “small” files the automatic mode will clean up, i.e. how many files will be grouped into large archives.

gc.auto (Default: 6700 objects)

Combine objects into a packfile.

gc.autopacklimit (Default: 50 packs)

Combine packs into one large pack file.

The automatic mode is often called, among others by receive-pack and rebase (interactive). In most cases the automatic mode does nothing, because the defaults are very conservative. If it does, it looks like this:

$ git gc --auto
Auto packing the repository for optimum performance. You may also
run "git gc" manually. See "git help gc" for more information.
...

B.2. Performance

You should either significantly lower the thresholds above which the automatic garbage collection takes effect, or call git gc from time to time. This has one obvious advantage, namely that disk space is saved:

$ du -sh .git
20M     .git
$ git gc
Counting objects: 3726, done.
Compressing objects: 100% (1639/1639), done.
Writing objects: 100% (3726/3726), done.
Total 3726 (delta 1961), reused 2341 (delta 1279)
Removing duplicate objects: 100% (256/256), done.
$ du -sh .git
6.3M    .git

Individual objects under .git/objects/ have been combined into a packfile:

$ ls -lh .git/objects/pack/pack-a97624dd23<...>.pack
-r-------- 1 feh feh 4.6M Jun  1 10:20 .git/objects/pack/pack-a97624dd23<...>.pack
$ file .git/objects/pack/pack-a97624dd23<...>.pack
.git/objects/pack/pack-a97624dd23<...>.pack: Git pack, version 2, 3726 objects

You can use git count-objects to output how many files the object database consists of. Here side by side before and after the above packing process:

$ git count-objects -v
count: 1905                             count: 58
size: 12700                             size: 456
in-pack: 3550                           in-pack: 3726
packs: 7                                packs: 1
size-pack: 4842                         size-pack: 4716
prune-packable: 97                      prune-packable: 0
garbage: 0                              garbage: 0

Nowadays disk space is cheap, so a repository compressed to 30% is not a big gain. But the performance gain is not to be scoffed at. Usually one object (e.g. a commit) will result in further objects (blobs, trees). So if Git has to open one file per object (i.e. at least n blob objects for n managed files), this means n read operations on the file system.

Packfiles have two major advantages: First, Git creates an index for each pack file, which indicates which object is found in which offset of the file. In addition, the packing routine has a certain heuristic to optimize object placement within the file (so that, for example, a tree object and the blob objects it references are stored “close” to each other). This allows Git to simply map the packfile into memory (keyword: “sliding mmap”). The “search object X” operation is then nothing more than a lookup operation in the pack index and a corresponding readout of the location in the pack file, i.e. in memory. This relieves the file and operating system considerably.

The second advantage of packfiles is the delta compression. This way, objects are stored as deltas (changes) of other objects, if possible.⁠[155] This saves memory space, but on the other hand also enables commands like git blame to detect copies of code pieces between files “inexpensively”, i.e. without much computing effort.

The aggressive mode should only be used in justified exceptional cases.⁠[156]

Run a git gc on your publicly accessible repositories on a regular basis, e.g. via cron. Commits are always transmitted via the git protocol as packfiles, which are generated on demand, i.e. at the time of retrieval. If the entire repository is already available as one large packfile, parts of it can be extracted more quickly, and a complete clone of the repository does not require any additional computational operations (no huge packfile has to be packed). A regular garbage collection can therefore reduce the load on your server, and the user cloning process is also accelerated.

If the repository is particularly large, it can take a long time for the server to count all objects in a git clone. You can speed this up by regularly calling git repack -A -d -b from the cron-job: Git will then create a bitmap file in addition to the pack files, speeding up this process by one or two orders of magnitude.


154. Since a bare repository (see Sec. 7.1.3, “Bare Repositories: Repositories Without Working Tree”) does not have a working tree, the contents normally located in .git form the top level in the directory structure, and there is no additional .git directory.
155. This is not to be confused with version control systems that store incremental versions of a file. Within packfiles, objects are packed independently of their semantic context, i.e. especially their temporal sequence.
156. A detailed discussion of the topic can be found at https://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/