Appendix B: Repository Structure
Git stores the object database, the associated references, etc. in the so-called Git directory, often referred to as $GIT_DIR
.
By default, this is .git/
.
It exists only once for each Git repository, i.e. no additional .git/
directories are created in subdirectories.[154]
Among other things, it contains the following entries:
HEAD
|
The |
config
|
The repository configuration file, see Sec. 1.3, “Configuring Git”. |
hooks/
|
Contains the hooks set for this repository, see Sec. 8.2, “Hooks”. |
index
|
The index or stage, see Sec. 2.1.1, “Index”. |
info/
|
Additional repository information, such as patterns to be ignored (see Sec. 4.4, “Ignoring Files”) and also grafts (see Sec. 8.4.3, “Grafts: Subsequent Merges”). You can put your own information there if other tools can handle it (see e.g. the section on caching of CGit, Sec. 7.5.4, “Exploiting Caching”). |
logs/
|
Log of changes to references; accessible via Reflog, see Sec. 3.7, “Reflog”.
Contains a log file for each reference under |
objects/
|
The object database, see Sec. 2.2.3, “The Object Database”.
For performance reasons, the objects are sorted into subdirectories that correspond to a two-character prefix of their SHA-1 sum (the commit |
refs/
|
All references, including branches in |
A detailed technical description can be found in the man page gitrepository-layout(5)
.
.git/
B.1. Cleaning Up
As mentioned in Sec. 3.1.2, “Managing Branches”, for example, commits that are no longer referenced (whether by branches or other commits) are no longer accessible. This is usually the case if you wanted to delete a commit (or have rebuilt commits with Rebase). Git does not delete them from the object database immediately, but leaves them there for two weeks by default, even if they are no longer accessible.
Internally, Git uses the commands prune
, prune-packed
, fsck
, repack
, etc.
However, the tools are automatically executed by the garbage collection with appropriate options: git gc
.
The tool performs the following tasks:
-
Delete Dangling and Unreachable Objects. These occur during various operations and can usually be deleted after some time to save space (default: after two weeks).
-
Re-pack Loose Objects. Git uses so-called packfiles to pack several Git objects together. (Then there is no longer one file under
.git/objects/
per blob, tree and commit — these are combined into one large, zlib-compressed file).
-
Search existing packfiles for old (unreachable) objects and “thin out” the packfiles accordingly. If necessary, several small packfiles are combined to large ones.
-
Packing references. This results in so-called Packed Refs, see also Sec. 3.1, “References: Branches and Tags”.
-
Delete old Reflog entries. By default this happens after 90 days.
-
Old conflict resolutions (see Rerere, Sec. 3.4.2, “Rerere: Reuse Recorded Resolution”) are discarded (15/60 days hold time for unresolved/solved).
The garbage collection has three modes: automatic, normal and aggressive.
You call the automatic mode via git gc --auto
— the mode checks if there are really blatant flaws in the Git repository.
What “blatant” means is configurable.
The following configuration settings allow you to determine (globally or per repository) when, i.e. how many “small” files the automatic mode will clean up, i.e. how many files will be grouped into large archives.
gc.auto
(Default: 6700 objects)-
Combine objects into a packfile.
gc.autopacklimit
(Default: 50 packs)-
Combine packs into one large pack file.
The automatic mode is often called, among others by receive-pack
and rebase
(interactive).
In most cases the automatic mode does nothing, because the defaults are very conservative.
If it does, it looks like this:
$ git gc --auto Auto packing the repository for optimum performance. You may also run "git gc" manually. See "git help gc" for more information. ...
B.2. Performance
You should either significantly lower the thresholds above which the automatic garbage collection takes effect, or call git gc
from time to time.
This has one obvious advantage, namely that disk space is saved:
$ du -sh .git 20M .git $ git gc Counting objects: 3726, done. Compressing objects: 100% (1639/1639), done. Writing objects: 100% (3726/3726), done. Total 3726 (delta 1961), reused 2341 (delta 1279) Removing duplicate objects: 100% (256/256), done. $ du -sh .git 6.3M .git
Individual objects under .git/objects/
have been combined into a packfile:
$ ls -lh .git/objects/pack/pack-a97624dd23<...>.pack -r-------- 1 feh feh 4.6M Jun 1 10:20 .git/objects/pack/pack-a97624dd23<...>.pack $ file .git/objects/pack/pack-a97624dd23<...>.pack .git/objects/pack/pack-a97624dd23<...>.pack: Git pack, version 2, 3726 objects
You can use git count-objects
to output how many files the object database consists of.
Here side by side before and after the above packing process:
$ git count-objects -v count: 1905 count: 58 size: 12700 size: 456 in-pack: 3550 in-pack: 3726 packs: 7 packs: 1 size-pack: 4842 size-pack: 4716 prune-packable: 97 prune-packable: 0 garbage: 0 garbage: 0
Nowadays disk space is cheap, so a repository compressed to 30% is not a big gain. But the performance gain is not to be scoffed at. Usually one object (e.g. a commit) will result in further objects (blobs, trees). So if Git has to open one file per object (i.e. at least n blob objects for n managed files), this means n read operations on the file system.
Packfiles have two major advantages: First, Git creates an index for each pack file, which indicates which object is found in which offset of the file. In addition, the packing routine has a certain heuristic to optimize object placement within the file (so that, for example, a tree object and the blob objects it references are stored “close” to each other). This allows Git to simply map the packfile into memory (keyword: “sliding mmap”). The “search object X” operation is then nothing more than a lookup operation in the pack index and a corresponding readout of the location in the pack file, i.e. in memory. This relieves the file and operating system considerably.
The second advantage of packfiles is the delta compression.
This way, objects are stored as deltas (changes) of other objects, if possible.[155]
This saves memory space, but on the other hand also enables commands like git blame
to detect copies of code pieces between files “inexpensively”, i.e. without much computing effort.
The aggressive mode should only be used in justified exceptional cases.[156]
Run a If the repository is particularly large, it can take a long time for the server to count all objects in a |