Learning Objectives

By the end of this module, you will be able to:

  1. Explain how Git functions as a content-addressable filesystem and database
  2. Describe the four object types (blobs, trees, commits, tags) and how they relate to each other
  3. Use plumbing commands to inspect and create objects in Git's database directly
  4. Navigate the .git directory and understand the purpose of each subdirectory
  5. Build a commit from scratch using only low-level plumbing commands

1. Git Is a Database

Most people think of Git as a version control tool — something that tracks changes to files. That's true at a high level, but underneath, Git is better understood as a content-addressable database.

What does that mean?

  • You give Git some content (a file, a directory listing, a commit)
  • Git computes a cryptographic hash of that content
  • Git stores the content in its database, using the hash as the key
  • To retrieve the content later, you look it up by its hash

Every object in Git's database is identified by a SHA-1 hash — a 40-character hexadecimal string like 5a32e7b9c1d4f8a2b6e3c0d9f7a1b4e8c2d5f6a3. The hash is computed from the content itself, which gives Git three powerful properties:

  1. Identical content always produces the same hash — if two files have the same content, they're stored once
  2. Any change produces a completely different hash — even changing one byte changes the hash entirely
  3. Corruption is detectable — if the stored data doesn't match its hash, something is wrong

Analogy: Think of a library where every book's shelf location is determined by a fingerprint of its contents. You don't need a catalog — the content itself tells you where to find it. And if someone tampers with a book, the fingerprint won't match the shelf label.

Abbreviating Hashes

You rarely need the full 40 characters. Git accepts abbreviated hashes as long as they're unambiguous in your repository:

git show 5a32e7b9c1d4f8a2b6e3c0d9f7a1b4e8c2d5f6a3   # full hash
git show 5a32e7b                                        # abbreviated — usually enough
git show 5a32                                           # minimum 4 characters

The --oneline flag on git log shows abbreviated hashes by default:

git log --oneline
# a1b2c3d Add user authentication
# e4f5a6b Fix login page layout
# 7c8d9e0 Initial commit

2. The Four Object Types

Git stores exactly four types of objects in its database. Everything Git does is built on top of these four primitives.

Object Hierarchy

  ┌──────────┐
  │   Tag    │──────► annotated tag metadata + pointer to commit
  └──────────┘
       │
       ▼
  ┌──────────┐
  │  Commit  │──────► author, date, message + pointer to tree
  └──────────┘
       │
       ▼
  ┌──────────┐        ┌──────────┐
  │   Tree   │──────► │   Tree   │  (subdirectory)
  └──────────┘        └──────────┘
       │                   │
       ▼                   ▼
  ┌──────────┐        ┌──────────┐
  │   Blob   │        │   Blob   │  (files)
  └──────────┘        └──────────┘

Blobs — File Contents

A blob (binary large object) stores the contents of a single file. Not the filename — just the raw content.

Key properties:

  • Two files with identical content (even with different names) share the same blob
  • A blob has no knowledge of its own filename, permissions, or location
  • Blobs are the "leaves" of Git's tree structure

Trees — Directory Listings

A tree represents a directory. It contains a list of entries, where each entry maps a name and file mode to either a blob (file) or another tree (subdirectory).

tree 8a3b2c1
├── 100644 blob a1b2c3d   README.md
├── 100644 blob e4f5a6b   app.py
├── 100755 blob 7c8d9e0   run.sh
└── 040000 tree f1e2d3c   src/

The file modes:

  • 100644 — regular file
  • 100755 — executable file
  • 040000 — subdirectory (tree)
  • 120000 — symbolic link
  • 160000 — submodule (gitlink)

Commits — Snapshots with Metadata

A commit is a pointer to a tree (the project's root directory at that point in time) plus metadata:

commit 5a32e7b

tree     8a3b2c1                          ← root tree (snapshot of project)
parent   e4f5a6b                          ← previous commit (missing for first commit)
author   Jane Doe <jane@example.com>      ← who wrote the change
         1705334400 +0000                 ← when
committer Jane Doe <jane@example.com>     ← who created the commit object
          1705334400 +0000                ← when

Add user authentication                   ← commit message

Important details:

  • parent — a reference to the previous commit. The first commit has no parent. A merge commit has two (or more) parents.
  • author vs. committer — usually the same person. They differ when someone applies another person's patch (e.g., git am, git cherry-pick).
  • The commit points to a tree, not to individual files. This is how Git stores snapshots, not deltas.

Tags — Named References to Commits

There are two kinds:

  • Lightweight tags — just a named pointer to a commit (not stored as an object)
  • Annotated tags — a full object with a tagger name, date, message, and a pointer to a commit
tag v1.0.0

object   5a32e7b                          ← the commit being tagged
type     commit
tagger   Jane Doe <jane@example.com>
         1705334400 +0000

Release version 1.0.0                     ← tag message

Annotated tags are preferred for releases because they carry metadata and can be signed with GPG/SSH.


3. Snapshots, Not Deltas

This is a fundamental design difference between Git and most earlier VCS tools.

Delta-Based Storage (SVN, CVS)

File A:  v1 ──Δ1──► v2 ──Δ2──► v3 ──Δ3──► v4
File B:  v1 ──Δ1──► v2 ──Δ2──► v3
File C:  v1 ──────────────────────────Δ1──► v2

To reconstruct File A at version 4, the system starts with v1 and applies three deltas. This is slow for distant versions and fragile if any delta is corrupted.

Snapshot-Based Storage (Git)

Commit 1:  [A₁] [B₁] [C₁]
Commit 2:  [A₂] [B₁] [C₁]     ← B and C unchanged: pointers to same blobs
Commit 3:  [A₂] [B₂] [C₁]     ← A unchanged: pointer to same blob as Commit 2
Commit 4:  [A₃] [B₂] [C₂]

Every commit records the complete state of the entire project. When a file hasn't changed, Git doesn't store a duplicate — the tree simply points to the same blob as before. This is why immutable objects that share structure are called persistent data structures.

Why Snapshots Win

  • Checkout is fast — switching to any commit means reading one tree, not replaying hundreds of deltas
  • Comparison is fast — comparing two commits means comparing two trees, not computing cumulative deltas
  • Corruption is isolated — a corrupted blob affects only that version of that file, not every version after it
  • Branching is trivial — a branch is just a pointer to a commit, which already has a complete snapshot

Note on packfiles: For network transfer and disk efficiency, Git does eventually compress objects into packfiles that use delta compression. But this is a storage optimization, not a conceptual model. Git always thinks in snapshots — the delta packing is purely an implementation detail handled by git gc.


4. Exploring the .git Directory

Every Git repository has a .git directory at its root. This is the repository. Everything outside .git is your working copy — a checkout of one particular commit.

ls -la .git/
.git/
├── HEAD                 ← pointer to the current branch
├── config               ← repository-level configuration
├── description          ← used by GitWeb (rarely relevant)
├── hooks/               ← client-side hook scripts
│   ├── pre-commit.sample
│   ├── commit-msg.sample
│   └── ...
├── info/
│   └── exclude          ← repo-level ignores (like .gitignore but not committed)
├── objects/             ← THE DATABASE — all blobs, trees, commits, tags
│   ├── pack/            ← packed objects (compressed)
│   ├── info/
│   ├── a1/              ← loose objects stored by first 2 chars of hash
│   │   └── b2c3d4e5...
│   └── ...
├── refs/                ← branch and tag pointers
│   ├── heads/           ← local branches (e.g., refs/heads/main)
│   ├── tags/            ← tags (e.g., refs/tags/v1.0.0)
│   └── remotes/         ← remote-tracking branches
│       └── origin/
│           ├── main
│           └── ...
├── index                ← the staging area (binary file)
├── logs/                ← reflog entries
└── COMMIT_EDITMSG       ← last commit message (for convenience)

The Critical Pieces

objects/ — This is the database. Every blob, tree, commit, and annotated tag is stored here. Objects are initially stored "loose" (one file per object, named by SHA-1, in a subdirectory named by the first two hex characters). Over time, git gc packs them into compressed packfiles in objects/pack/.

refs/ — Branch and tag pointers. Each file contains a single SHA-1 hash. For example:

cat .git/refs/heads/main
# 5a32e7b9c1d4f8a2b6e3c0d9f7a1b4e8c2d5f6a3

That's it. A branch is a 40-character text file.

HEAD — Points to the current branch (or directly to a commit if in "detached HEAD" state):

cat .git/HEAD
# ref: refs/heads/main         ← normal: points to a branch
# 5a32e7b9c1d4f8a2b6e3c0d9f7a1b4e8c2d5f6a3    ← detached HEAD: points to a commit

index — The staging area, stored as a binary file. This is the "snapshot-in-progress" that will become the next commit's tree.


5. Plumbing Commands

Git has two layers of commands:

  • Porcelain — the high-level, user-friendly commands you use daily (git add, git commit, git log, git push)
  • Plumbing — the low-level commands that manipulate the database directly

Plumbing commands let you see exactly what Git is doing behind the scenes.

git cat-file — Inspect Any Object

# Show the type of an object
git cat-file -t <hash>
# commit, tree, blob, or tag
 
# Show the size of an object
git cat-file -s <hash>
 
# Pretty-print the content of an object
git cat-file -p <hash>

Examples:

# Inspect a commit
$ git cat-file -p a1b2c3d
tree 8a3b2c1e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c
parent e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3
author Jane Doe <jane@example.com> 1705334400 +0000
committer Jane Doe <jane@example.com> 1705334400 +0000
 
Add user authentication
 
# Inspect a tree
$ git cat-file -p 8a3b2c1
100644 blob d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0    README.md
100644 blob a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9    app.py
040000 tree c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3    src
 
# Inspect a blob
$ git cat-file -p d1e2f3a
# Hello, World!
# This is the README.
 
# Check the type
$ git cat-file -t a1b2c3d
commit
$ git cat-file -t 8a3b2c1
tree
$ git cat-file -t d1e2f3a
blob

git ls-tree — List a Tree's Contents

git ls-tree <tree-or-commit-hash>

When given a commit hash, Git automatically looks at the commit's root tree:

$ git ls-tree HEAD
100644 blob d1e2f3a4...   README.md
100644 blob a0b1c2d3...   app.py
040000 tree c4d5e6f7...   src
 
# Recursive listing (show files in subdirectories too)
$ git ls-tree -r HEAD
100644 blob d1e2f3a4...   README.md
100644 blob a0b1c2d3...   app.py
100644 blob b2c3d4e5...   src/main.py
100644 blob e5f6a7b8...   src/utils.py

git hash-object — Compute (and Optionally Store) a Hash

# Compute the hash of a file without storing it
echo "Hello, World!" | git hash-object --stdin
# 8ab686eafeb1f44702738c8b0f24f2567c36da6d
 
# Compute and store it in the database
echo "Hello, World!" | git hash-object --stdin -w
# 8ab686eafeb1f44702738c8b0f24f2567c36da6d

The -w flag writes the blob to .git/objects/. Without it, Git only computes the hash.

git write-tree — Snapshot the Staging Area as a Tree

git write-tree
# Returns the hash of the newly created tree object

This reads the current staging area (index) and creates a tree object from it.

git commit-tree — Create a Commit from a Tree

echo "My commit message" | git commit-tree <tree-hash> -p <parent-hash>
# Returns the hash of the newly created commit object

This creates a commit pointing to the given tree, with the given parent.

git update-ref — Move a Branch Pointer

git update-ref refs/heads/main <commit-hash>

This moves the main branch to point to a different commit. This is what happens behind the scenes when you commit, merge, or reset.


6. How Git Stores Objects on Disk

When Git stores an object, it:

  1. Prepends a header: "<type> <size>\0" (e.g., "blob 14\0")
  2. Concatenates the header with the content
  3. Computes the SHA-1 hash of the combined result
  4. Compresses the result with zlib
  5. Stores it at .git/objects/<first-2-chars>/<remaining-38-chars>

For example, a blob with hash 8ab686eafeb1f44702738c8b0f24f2567c36da6d is stored at:

.git/objects/8a/b686eafeb1f44702738c8b0f24f2567c36da6d

This two-level directory structure avoids having too many files in a single directory (which is slow on some filesystems).

Packfiles

Over time, loose objects accumulate. When you run git gc (garbage collection), or when Git triggers it automatically, objects are packed into packfiles:

.git/objects/pack/
├── pack-abc123.idx    ← index: maps hashes to offsets in the pack
└── pack-abc123.pack   ← data: all objects, delta-compressed

Inside a packfile, Git does use delta compression — storing only the differences between similar objects. But this is purely a storage and transfer optimization. Conceptually, Git always treats every object as a complete snapshot.

# Trigger garbage collection manually
git gc
 
# See pack statistics
git count-objects -v

7. SHA-1 and the Move to SHA-256

Git has used SHA-1 since its creation in 2005. SHA-1 was considered cryptographically secure at the time, but in 2017, Google demonstrated a practical collision (two different inputs producing the same hash — the "SHAttered" attack).

Is This a Problem for Git?

In practice, not yet. The SHAttered attack required enormous computational resources, and Git has added mitigations to detect the specific attack pattern. However, the writing is on the wall — SHA-1's long-term security is compromised.

The SHA-256 Transition

Git has been adding SHA-256 support since Git 2.29 (2020). You can already create SHA-256 repositories:

git init --object-format=sha256

SHA-256 hashes are 64 characters long instead of 40. The transition is gradual — most repositories, including GitHub, still use SHA-1. The Git project plans to support both formats and provide migration tools.

For now, SHA-1 is the default and works fine. Just know that the transition is coming.


8. Immutability and Persistent Data Structures

Every object in Git's database is immutable — once written, it is never modified. New commits don't overwrite old ones; they create new objects that point back to the old ones.

When immutable data structures share parts of their structure, they're called persistent data structures. In Git:

Commit 1:                    Commit 2:
tree ─► README.md (blob A)   tree ─► README.md (blob A)    ← SAME blob, shared
        app.py    (blob B)           app.py    (blob C)    ← different blob, new version
        src/      (tree X)           src/      (tree X)    ← SAME tree, shared

Commit 2 only creates new objects for what changed — app.py gets a new blob, and the root tree gets a new tree object (because it has a different entry for app.py). Everything else is shared by reference.

This is why Git is so space-efficient despite storing "complete snapshots." Most of each snapshot is shared with adjacent commits.


Command Reference

CommandDescription
git cat-file -t <hash>Show the type of an object (blob, tree, commit, tag)
git cat-file -s <hash>Show the size of an object in bytes
git cat-file -p <hash>Pretty-print the content of an object
git ls-tree <hash>List the contents of a tree object
git ls-tree -r <hash>Recursively list all files in a tree
git hash-object --stdinCompute the SHA-1 hash of stdin
git hash-object --stdin -wCompute the hash and write blob to database
git hash-object <file>Compute the hash of a file
git write-treeCreate a tree object from the current staging area
git commit-tree <tree> -p <parent>Create a commit object manually
git update-ref refs/heads/<branch> <hash>Move a branch pointer to a commit
git show <hash>Display an object (commit diff, tree listing, blob content)
git log --onelineShow commit history with abbreviated hashes
git count-objects -vShow database statistics (loose objects, packfile size)
git gcRun garbage collection (pack loose objects, prune unreachable objects)
git fsckVerify the integrity of the object database
git rev-parse HEADShow the full hash of HEAD

Hands-On Lab: Build a Commit from Scratch

This lab uses only plumbing commands to construct a commit. No git add, no git commit — just raw database operations. By the end, you'll understand exactly what happens when you create a commit.

Setup

mkdir ~/git-internals-lab
cd ~/git-internals-lab
git init

Checkpoint:

ls -la .git/

You should see the .git directory with objects/, refs/, HEAD, etc.

find .git/objects -type f

No output — the object database is empty.

Step 1: Create a Blob Manually

Write a file's content directly into the database:

echo "Hello from the Git internals lab!" | git hash-object --stdin -w

You'll get a hash like af5626b4a114abcb82d63db7c8082c3c4756e51b. Save this — it's your blob.

Checkpoint:

find .git/objects -type f

You should see one file, e.g., .git/objects/af/5626b4a114abcb82d63db7c8082c3c4756e51b.

git cat-file -t af5626b
# blob
 
git cat-file -p af5626b
# Hello from the Git internals lab!

Step 2: Create a Second Blob

echo "print('Hello, Git!')" | git hash-object --stdin -w

Save this hash too (e.g., b3d1f2e...).

Checkpoint:

git cat-file -t <second-hash>
# blob
 
find .git/objects -type f | wc -l
# 2

Step 3: Build a Tree Manually

A tree maps filenames to blobs. We need to stage our blobs with filenames using git update-index, then write the tree:

# Stage the first blob as "README.md"
git update-index --add --cacheinfo 100644,<first-blob-hash>,README.md
 
# Stage the second blob as "hello.py"
git update-index --add --cacheinfo 100644,<second-blob-hash>,hello.py

Now create the tree object:

git write-tree

This returns a tree hash (e.g., c3d4e5f...).

Checkpoint:

git cat-file -t <tree-hash>
# tree
 
git ls-tree <tree-hash>
# 100644 blob af5626b...   README.md
# 100644 blob b3d1f2e...   hello.py

Step 4: Create a Commit Manually

echo "My first hand-crafted commit" | git commit-tree <tree-hash>

This returns a commit hash (e.g., d4e5f6a...). Since we didn't specify -p, this commit has no parent — it's a root commit.

Checkpoint:

git cat-file -t <commit-hash>
# commit
 
git cat-file -p <commit-hash>
# tree c3d4e5f...
# author Your Name <you@example.com> 1705334400 +0000
# committer Your Name <you@example.com> 1705334400 +0000
#
# My first hand-crafted commit

Step 5: Point a Branch to Your Commit

Right now, main doesn't point to anything useful. Let's fix that:

git update-ref refs/heads/main <commit-hash>

Checkpoint:

cat .git/refs/heads/main
# d4e5f6a...  (your commit hash)
 
git log --oneline
# d4e5f6a My first hand-crafted commit

You've built a fully valid commit using nothing but plumbing commands.

Step 6: Verify It's Real

git status

Git now recognizes your manually created tree and will show your working directory state relative to it. The files don't actually exist in the working directory yet, so Git will show them as "deleted."

Let's check them out:

git checkout -- .
ls
# README.md  hello.py
 
cat README.md
# Hello from the Git internals lab!
 
cat hello.py
# print('Hello, Git!')

Step 7: Create a Second Commit with a Parent

Make a change:

echo "Updated README content" | git hash-object --stdin -w
# Returns a new blob hash, e.g., f6a7b8c...

Update the index and create a new tree:

git update-index --cacheinfo 100644,<new-blob-hash>,README.md
git write-tree
# Returns a new tree hash, e.g., a8b9c0d...

Create a commit with the first commit as parent:

echo "Update README" | git commit-tree <new-tree-hash> -p <first-commit-hash>
# Returns a new commit hash, e.g., e9f0a1b...

Update the branch:

git update-ref refs/heads/main <new-commit-hash>

Checkpoint:

git log --oneline
# e9f0a1b Update README
# d4e5f6a My first hand-crafted commit
 
git cat-file -p e9f0a1b
# tree a8b9c0d...
# parent d4e5f6a...                    ← linked to the first commit!
# ...

You've built a two-commit history from scratch using only plumbing commands.

Step 8: Explore Object Reuse

# hello.py hasn't changed — is the blob shared?
git ls-tree HEAD
git ls-tree HEAD~1
 
# Compare the blob hashes for hello.py — they should be identical

This demonstrates persistent data structures in action: unchanged files share the same blob across commits.

Challenge

  1. Create a subdirectory tree: add a src/ directory with a main.py file using plumbing commands. Hint: you'll need to create a tree for src/ first, then include it as an entry in the root tree.

  2. Create a third commit that includes the src/ directory. Verify the directory structure with git ls-tree -r HEAD.

  3. Run git fsck to verify the integrity of your hand-built repository.

Cleanup

rm -rf ~/git-internals-lab

Common Pitfalls & Troubleshooting

PitfallExplanation
"Git tracks files, not directories"Git only stores directory structure through tree objects, which require at least one file. Empty directories are invisible to Git. The convention is to place a .gitkeep or .keep file inside empty directories you want to track.
Confusing the hash with the contentThe hash identifies content — it is not the content itself. Two different hashes always mean different content. Same hash always means same content.
"My commit has a different hash than yours"Commit hashes include the author, timestamp, and parent — so two people committing identical file changes will always get different commit hashes. Only blob hashes are purely content-dependent.
Thinking git gc deletes your dataGarbage collection packs loose objects and removes unreachable objects (ones no branch, tag, or reflog points to). It doesn't delete reachable data. Unreachable objects are kept for at least 2 weeks by default (configurable via gc.pruneExpire).
Editing files in .git/ directlyDon't. Use plumbing commands instead. Manually editing object files will corrupt them (they're zlib-compressed and have headers). Manually editing refs/heads/main is technically safe but git update-ref is safer.

Pro Tips

  1. Use git cat-file -p to debug anything. When something seems wrong with a commit, tree, or branch, inspect the raw objects. Understanding the database gives you X-ray vision into every Git problem.

  2. git fsck is your integrity checker. Run it if you suspect corruption (e.g., after a disk failure or interrupted operation). It walks the entire object graph and reports dangling, missing, or corrupt objects.

  3. Everything is a hash. Branch names, tag names, HEAD, HEAD~3, main^2 — these are all just ways to resolve to a commit hash. Run git rev-parse <ref> to see the hash that any reference resolves to:

    git rev-parse HEAD
    git rev-parse main
    git rev-parse v1.0.0
  4. The staging area is a tree draft. The index file (.git/index) is a flattened tree structure — a draft of the tree that will be written by your next commit. git add updates the index. git write-tree converts it to a tree object.

  5. Blobs are content-addressed, not path-addressed. If you rename a file without changing its content, Git doesn't store a new blob. The new tree just maps a different filename to the same blob hash. This is how Git detects renames — by noticing that a blob disappeared from one path and appeared at another.

  6. git count-objects -v reveals database health.

    git count-objects -v
    # count: 15          ← loose objects
    # size: 60           ← size in KB
    # in-pack: 1234      ← objects in packfiles
    # size-pack: 450     ← packfile size in KB
    # prune-packable: 0  ← loose objects already in packs (safe to prune)
    # garbage: 0         ← corrupt files

Quiz / Self-Assessment

1. What are the four types of objects Git stores in its database?

Answer
Blobs (file contents), trees (directory listings), commits (snapshot metadata + pointer to root tree), and tags (annotated tag metadata + pointer to commit).

2. How does Git identify objects in its database?

Answer
By the SHA-1 cryptographic hash of their content (including a type+size header). The hash is a 40-character hexadecimal string that serves as the object's unique key.

3. Does Git store deltas (differences) or snapshots?

Answer
Conceptually, Git stores snapshots. Every commit points to a tree that represents the complete state of the project. Unchanged files are not duplicated — the tree points to the existing blob. Internally, Git uses delta compression in packfiles for storage efficiency, but this is transparent to the user.

4. What does a tree object contain?

Answer
A list of entries, each mapping a filename and file mode (permissions) to either a blob (for files) or another tree (for subdirectories).

5. Where does Git physically store a blob with hash af5626b4a114abcb82d63db7c8082c3c4756e51b?

Answer
At .git/objects/af/5626b4a114abcb82d63db7c8082c3c4756e51b. The first 2 characters of the hash form the subdirectory name, the remaining 38 characters form the filename. The file is zlib-compressed.

6. What is the difference between a blob and a tree?

Answer
A blob stores raw file content with no knowledge of its filename or location. A tree stores a directory listing — it maps filenames and permissions to blobs (files) or other trees (subdirectories).

7. What does git cat-file -t <hash> do?

Answer
It prints the type of the object — one of blob, tree, commit, or tag.

8. Why can Git detect data corruption?

Answer
Every object is stored under a key that is the SHA-1 hash of its content. If the stored data is modified (corrupted), recalculating the hash will produce a different value than the key, revealing the corruption. git fsck performs this check across the entire database.

9. What is the purpose of git gc (garbage collection)?

Answer
It packs loose objects into packfiles (delta-compressed for storage efficiency), and removes unreachable objects that are older than the prune expiry (default: 2 weeks). This saves disk space and improves performance.

10. If you rename a file without changing its content, does Git create a new blob?

Answer
No. Blobs are content-addressed — the same content always produces the same hash. Renaming a file only changes the tree object (which maps filenames to blobs). The blob itself is reused. This is how Git detects renames: the same blob hash appears under a different path.