Learning Objectives
By the end of this module, you will be able to:
- Explain how Git functions as a content-addressable filesystem and database
- Describe the four object types (blobs, trees, commits, tags) and how they relate to each other
- Use plumbing commands to inspect and create objects in Git's database directly
- Navigate the
.gitdirectory and understand the purpose of each subdirectory - Build a commit from scratch using only low-level plumbing commands
1. Git Is a Database
Most people think of Git as a version control tool — something that tracks changes to files. That's true at a high level, but underneath, Git is better understood as a content-addressable database.
What does that mean?
- You give Git some content (a file, a directory listing, a commit)
- Git computes a cryptographic hash of that content
- Git stores the content in its database, using the hash as the key
- To retrieve the content later, you look it up by its hash
Every object in Git's database is identified by a SHA-1 hash — a 40-character hexadecimal string like 5a32e7b9c1d4f8a2b6e3c0d9f7a1b4e8c2d5f6a3. The hash is computed from the content itself, which gives Git three powerful properties:
- Identical content always produces the same hash — if two files have the same content, they're stored once
- Any change produces a completely different hash — even changing one byte changes the hash entirely
- Corruption is detectable — if the stored data doesn't match its hash, something is wrong
Analogy: Think of a library where every book's shelf location is determined by a fingerprint of its contents. You don't need a catalog — the content itself tells you where to find it. And if someone tampers with a book, the fingerprint won't match the shelf label.
Abbreviating Hashes
You rarely need the full 40 characters. Git accepts abbreviated hashes as long as they're unambiguous in your repository:
git show 5a32e7b9c1d4f8a2b6e3c0d9f7a1b4e8c2d5f6a3 # full hash
git show 5a32e7b # abbreviated — usually enough
git show 5a32 # minimum 4 charactersThe --oneline flag on git log shows abbreviated hashes by default:
git log --oneline
# a1b2c3d Add user authentication
# e4f5a6b Fix login page layout
# 7c8d9e0 Initial commit2. The Four Object Types
Git stores exactly four types of objects in its database. Everything Git does is built on top of these four primitives.
Object Hierarchy
┌──────────┐
│ Tag │──────► annotated tag metadata + pointer to commit
└──────────┘
│
▼
┌──────────┐
│ Commit │──────► author, date, message + pointer to tree
└──────────┘
│
▼
┌──────────┐ ┌──────────┐
│ Tree │──────► │ Tree │ (subdirectory)
└──────────┘ └──────────┘
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Blob │ │ Blob │ (files)
└──────────┘ └──────────┘
Blobs — File Contents
A blob (binary large object) stores the contents of a single file. Not the filename — just the raw content.
Key properties:
- Two files with identical content (even with different names) share the same blob
- A blob has no knowledge of its own filename, permissions, or location
- Blobs are the "leaves" of Git's tree structure
Trees — Directory Listings
A tree represents a directory. It contains a list of entries, where each entry maps a name and file mode to either a blob (file) or another tree (subdirectory).
tree 8a3b2c1
├── 100644 blob a1b2c3d README.md
├── 100644 blob e4f5a6b app.py
├── 100755 blob 7c8d9e0 run.sh
└── 040000 tree f1e2d3c src/
The file modes:
100644— regular file100755— executable file040000— subdirectory (tree)120000— symbolic link160000— submodule (gitlink)
Commits — Snapshots with Metadata
A commit is a pointer to a tree (the project's root directory at that point in time) plus metadata:
commit 5a32e7b
tree 8a3b2c1 ← root tree (snapshot of project)
parent e4f5a6b ← previous commit (missing for first commit)
author Jane Doe <jane@example.com> ← who wrote the change
1705334400 +0000 ← when
committer Jane Doe <jane@example.com> ← who created the commit object
1705334400 +0000 ← when
Add user authentication ← commit message
Important details:
- parent — a reference to the previous commit. The first commit has no parent. A merge commit has two (or more) parents.
- author vs. committer — usually the same person. They differ when someone applies another person's patch (e.g.,
git am,git cherry-pick). - The commit points to a tree, not to individual files. This is how Git stores snapshots, not deltas.
Tags — Named References to Commits
There are two kinds:
- Lightweight tags — just a named pointer to a commit (not stored as an object)
- Annotated tags — a full object with a tagger name, date, message, and a pointer to a commit
tag v1.0.0
object 5a32e7b ← the commit being tagged
type commit
tagger Jane Doe <jane@example.com>
1705334400 +0000
Release version 1.0.0 ← tag message
Annotated tags are preferred for releases because they carry metadata and can be signed with GPG/SSH.
3. Snapshots, Not Deltas
This is a fundamental design difference between Git and most earlier VCS tools.
Delta-Based Storage (SVN, CVS)
File A: v1 ──Δ1──► v2 ──Δ2──► v3 ──Δ3──► v4
File B: v1 ──Δ1──► v2 ──Δ2──► v3
File C: v1 ──────────────────────────Δ1──► v2
To reconstruct File A at version 4, the system starts with v1 and applies three deltas. This is slow for distant versions and fragile if any delta is corrupted.
Snapshot-Based Storage (Git)
Commit 1: [A₁] [B₁] [C₁]
Commit 2: [A₂] [B₁] [C₁] ← B and C unchanged: pointers to same blobs
Commit 3: [A₂] [B₂] [C₁] ← A unchanged: pointer to same blob as Commit 2
Commit 4: [A₃] [B₂] [C₂]
Every commit records the complete state of the entire project. When a file hasn't changed, Git doesn't store a duplicate — the tree simply points to the same blob as before. This is why immutable objects that share structure are called persistent data structures.
Why Snapshots Win
- Checkout is fast — switching to any commit means reading one tree, not replaying hundreds of deltas
- Comparison is fast — comparing two commits means comparing two trees, not computing cumulative deltas
- Corruption is isolated — a corrupted blob affects only that version of that file, not every version after it
- Branching is trivial — a branch is just a pointer to a commit, which already has a complete snapshot
Note on packfiles: For network transfer and disk efficiency, Git does eventually compress objects into packfiles that use delta compression. But this is a storage optimization, not a conceptual model. Git always thinks in snapshots — the delta packing is purely an implementation detail handled by
git gc.
4. Exploring the .git Directory
Every Git repository has a .git directory at its root. This is the repository. Everything outside .git is your working copy — a checkout of one particular commit.
ls -la .git/.git/
├── HEAD ← pointer to the current branch
├── config ← repository-level configuration
├── description ← used by GitWeb (rarely relevant)
├── hooks/ ← client-side hook scripts
│ ├── pre-commit.sample
│ ├── commit-msg.sample
│ └── ...
├── info/
│ └── exclude ← repo-level ignores (like .gitignore but not committed)
├── objects/ ← THE DATABASE — all blobs, trees, commits, tags
│ ├── pack/ ← packed objects (compressed)
│ ├── info/
│ ├── a1/ ← loose objects stored by first 2 chars of hash
│ │ └── b2c3d4e5...
│ └── ...
├── refs/ ← branch and tag pointers
│ ├── heads/ ← local branches (e.g., refs/heads/main)
│ ├── tags/ ← tags (e.g., refs/tags/v1.0.0)
│ └── remotes/ ← remote-tracking branches
│ └── origin/
│ ├── main
│ └── ...
├── index ← the staging area (binary file)
├── logs/ ← reflog entries
└── COMMIT_EDITMSG ← last commit message (for convenience)
The Critical Pieces
objects/ — This is the database. Every blob, tree, commit, and annotated tag is stored here. Objects are initially stored "loose" (one file per object, named by SHA-1, in a subdirectory named by the first two hex characters). Over time, git gc packs them into compressed packfiles in objects/pack/.
refs/ — Branch and tag pointers. Each file contains a single SHA-1 hash. For example:
cat .git/refs/heads/main
# 5a32e7b9c1d4f8a2b6e3c0d9f7a1b4e8c2d5f6a3That's it. A branch is a 40-character text file.
HEAD — Points to the current branch (or directly to a commit if in "detached HEAD" state):
cat .git/HEAD
# ref: refs/heads/main ← normal: points to a branch
# 5a32e7b9c1d4f8a2b6e3c0d9f7a1b4e8c2d5f6a3 ← detached HEAD: points to a commitindex — The staging area, stored as a binary file. This is the "snapshot-in-progress" that will become the next commit's tree.
5. Plumbing Commands
Git has two layers of commands:
- Porcelain — the high-level, user-friendly commands you use daily (
git add,git commit,git log,git push) - Plumbing — the low-level commands that manipulate the database directly
Plumbing commands let you see exactly what Git is doing behind the scenes.
git cat-file — Inspect Any Object
# Show the type of an object
git cat-file -t <hash>
# commit, tree, blob, or tag
# Show the size of an object
git cat-file -s <hash>
# Pretty-print the content of an object
git cat-file -p <hash>Examples:
# Inspect a commit
$ git cat-file -p a1b2c3d
tree 8a3b2c1e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c
parent e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3
author Jane Doe <jane@example.com> 1705334400 +0000
committer Jane Doe <jane@example.com> 1705334400 +0000
Add user authentication
# Inspect a tree
$ git cat-file -p 8a3b2c1
100644 blob d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0 README.md
100644 blob a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9 app.py
040000 tree c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3 src
# Inspect a blob
$ git cat-file -p d1e2f3a
# Hello, World!
# This is the README.
# Check the type
$ git cat-file -t a1b2c3d
commit
$ git cat-file -t 8a3b2c1
tree
$ git cat-file -t d1e2f3a
blobgit ls-tree — List a Tree's Contents
git ls-tree <tree-or-commit-hash>When given a commit hash, Git automatically looks at the commit's root tree:
$ git ls-tree HEAD
100644 blob d1e2f3a4... README.md
100644 blob a0b1c2d3... app.py
040000 tree c4d5e6f7... src
# Recursive listing (show files in subdirectories too)
$ git ls-tree -r HEAD
100644 blob d1e2f3a4... README.md
100644 blob a0b1c2d3... app.py
100644 blob b2c3d4e5... src/main.py
100644 blob e5f6a7b8... src/utils.pygit hash-object — Compute (and Optionally Store) a Hash
# Compute the hash of a file without storing it
echo "Hello, World!" | git hash-object --stdin
# 8ab686eafeb1f44702738c8b0f24f2567c36da6d
# Compute and store it in the database
echo "Hello, World!" | git hash-object --stdin -w
# 8ab686eafeb1f44702738c8b0f24f2567c36da6dThe -w flag writes the blob to .git/objects/. Without it, Git only computes the hash.
git write-tree — Snapshot the Staging Area as a Tree
git write-tree
# Returns the hash of the newly created tree objectThis reads the current staging area (index) and creates a tree object from it.
git commit-tree — Create a Commit from a Tree
echo "My commit message" | git commit-tree <tree-hash> -p <parent-hash>
# Returns the hash of the newly created commit objectThis creates a commit pointing to the given tree, with the given parent.
git update-ref — Move a Branch Pointer
git update-ref refs/heads/main <commit-hash>This moves the main branch to point to a different commit. This is what happens behind the scenes when you commit, merge, or reset.
6. How Git Stores Objects on Disk
When Git stores an object, it:
- Prepends a header:
"<type> <size>\0"(e.g.,"blob 14\0") - Concatenates the header with the content
- Computes the SHA-1 hash of the combined result
- Compresses the result with zlib
- Stores it at
.git/objects/<first-2-chars>/<remaining-38-chars>
For example, a blob with hash 8ab686eafeb1f44702738c8b0f24f2567c36da6d is stored at:
.git/objects/8a/b686eafeb1f44702738c8b0f24f2567c36da6d
This two-level directory structure avoids having too many files in a single directory (which is slow on some filesystems).
Packfiles
Over time, loose objects accumulate. When you run git gc (garbage collection), or when Git triggers it automatically, objects are packed into packfiles:
.git/objects/pack/
├── pack-abc123.idx ← index: maps hashes to offsets in the pack
└── pack-abc123.pack ← data: all objects, delta-compressed
Inside a packfile, Git does use delta compression — storing only the differences between similar objects. But this is purely a storage and transfer optimization. Conceptually, Git always treats every object as a complete snapshot.
# Trigger garbage collection manually
git gc
# See pack statistics
git count-objects -v7. SHA-1 and the Move to SHA-256
Git has used SHA-1 since its creation in 2005. SHA-1 was considered cryptographically secure at the time, but in 2017, Google demonstrated a practical collision (two different inputs producing the same hash — the "SHAttered" attack).
Is This a Problem for Git?
In practice, not yet. The SHAttered attack required enormous computational resources, and Git has added mitigations to detect the specific attack pattern. However, the writing is on the wall — SHA-1's long-term security is compromised.
The SHA-256 Transition
Git has been adding SHA-256 support since Git 2.29 (2020). You can already create SHA-256 repositories:
git init --object-format=sha256SHA-256 hashes are 64 characters long instead of 40. The transition is gradual — most repositories, including GitHub, still use SHA-1. The Git project plans to support both formats and provide migration tools.
For now, SHA-1 is the default and works fine. Just know that the transition is coming.
8. Immutability and Persistent Data Structures
Every object in Git's database is immutable — once written, it is never modified. New commits don't overwrite old ones; they create new objects that point back to the old ones.
When immutable data structures share parts of their structure, they're called persistent data structures. In Git:
Commit 1: Commit 2:
tree ─► README.md (blob A) tree ─► README.md (blob A) ← SAME blob, shared
app.py (blob B) app.py (blob C) ← different blob, new version
src/ (tree X) src/ (tree X) ← SAME tree, shared
Commit 2 only creates new objects for what changed — app.py gets a new blob, and the root tree gets a new tree object (because it has a different entry for app.py). Everything else is shared by reference.
This is why Git is so space-efficient despite storing "complete snapshots." Most of each snapshot is shared with adjacent commits.
Command Reference
| Command | Description |
|---|---|
git cat-file -t <hash> | Show the type of an object (blob, tree, commit, tag) |
git cat-file -s <hash> | Show the size of an object in bytes |
git cat-file -p <hash> | Pretty-print the content of an object |
git ls-tree <hash> | List the contents of a tree object |
git ls-tree -r <hash> | Recursively list all files in a tree |
git hash-object --stdin | Compute the SHA-1 hash of stdin |
git hash-object --stdin -w | Compute the hash and write blob to database |
git hash-object <file> | Compute the hash of a file |
git write-tree | Create a tree object from the current staging area |
git commit-tree <tree> -p <parent> | Create a commit object manually |
git update-ref refs/heads/<branch> <hash> | Move a branch pointer to a commit |
git show <hash> | Display an object (commit diff, tree listing, blob content) |
git log --oneline | Show commit history with abbreviated hashes |
git count-objects -v | Show database statistics (loose objects, packfile size) |
git gc | Run garbage collection (pack loose objects, prune unreachable objects) |
git fsck | Verify the integrity of the object database |
git rev-parse HEAD | Show the full hash of HEAD |
Hands-On Lab: Build a Commit from Scratch
This lab uses only plumbing commands to construct a commit. No git add, no git commit — just raw database operations. By the end, you'll understand exactly what happens when you create a commit.
Setup
mkdir ~/git-internals-lab
cd ~/git-internals-lab
git initCheckpoint:
ls -la .git/You should see the .git directory with objects/, refs/, HEAD, etc.
find .git/objects -type fNo output — the object database is empty.
Step 1: Create a Blob Manually
Write a file's content directly into the database:
echo "Hello from the Git internals lab!" | git hash-object --stdin -wYou'll get a hash like af5626b4a114abcb82d63db7c8082c3c4756e51b. Save this — it's your blob.
Checkpoint:
find .git/objects -type fYou should see one file, e.g., .git/objects/af/5626b4a114abcb82d63db7c8082c3c4756e51b.
git cat-file -t af5626b
# blob
git cat-file -p af5626b
# Hello from the Git internals lab!Step 2: Create a Second Blob
echo "print('Hello, Git!')" | git hash-object --stdin -wSave this hash too (e.g., b3d1f2e...).
Checkpoint:
git cat-file -t <second-hash>
# blob
find .git/objects -type f | wc -l
# 2Step 3: Build a Tree Manually
A tree maps filenames to blobs. We need to stage our blobs with filenames using git update-index, then write the tree:
# Stage the first blob as "README.md"
git update-index --add --cacheinfo 100644,<first-blob-hash>,README.md
# Stage the second blob as "hello.py"
git update-index --add --cacheinfo 100644,<second-blob-hash>,hello.pyNow create the tree object:
git write-treeThis returns a tree hash (e.g., c3d4e5f...).
Checkpoint:
git cat-file -t <tree-hash>
# tree
git ls-tree <tree-hash>
# 100644 blob af5626b... README.md
# 100644 blob b3d1f2e... hello.pyStep 4: Create a Commit Manually
echo "My first hand-crafted commit" | git commit-tree <tree-hash>This returns a commit hash (e.g., d4e5f6a...). Since we didn't specify -p, this commit has no parent — it's a root commit.
Checkpoint:
git cat-file -t <commit-hash>
# commit
git cat-file -p <commit-hash>
# tree c3d4e5f...
# author Your Name <you@example.com> 1705334400 +0000
# committer Your Name <you@example.com> 1705334400 +0000
#
# My first hand-crafted commitStep 5: Point a Branch to Your Commit
Right now, main doesn't point to anything useful. Let's fix that:
git update-ref refs/heads/main <commit-hash>Checkpoint:
cat .git/refs/heads/main
# d4e5f6a... (your commit hash)
git log --oneline
# d4e5f6a My first hand-crafted commitYou've built a fully valid commit using nothing but plumbing commands.
Step 6: Verify It's Real
git statusGit now recognizes your manually created tree and will show your working directory state relative to it. The files don't actually exist in the working directory yet, so Git will show them as "deleted."
Let's check them out:
git checkout -- .
ls
# README.md hello.py
cat README.md
# Hello from the Git internals lab!
cat hello.py
# print('Hello, Git!')Step 7: Create a Second Commit with a Parent
Make a change:
echo "Updated README content" | git hash-object --stdin -w
# Returns a new blob hash, e.g., f6a7b8c...Update the index and create a new tree:
git update-index --cacheinfo 100644,<new-blob-hash>,README.md
git write-tree
# Returns a new tree hash, e.g., a8b9c0d...Create a commit with the first commit as parent:
echo "Update README" | git commit-tree <new-tree-hash> -p <first-commit-hash>
# Returns a new commit hash, e.g., e9f0a1b...Update the branch:
git update-ref refs/heads/main <new-commit-hash>Checkpoint:
git log --oneline
# e9f0a1b Update README
# d4e5f6a My first hand-crafted commit
git cat-file -p e9f0a1b
# tree a8b9c0d...
# parent d4e5f6a... ← linked to the first commit!
# ...You've built a two-commit history from scratch using only plumbing commands.
Step 8: Explore Object Reuse
# hello.py hasn't changed — is the blob shared?
git ls-tree HEAD
git ls-tree HEAD~1
# Compare the blob hashes for hello.py — they should be identicalThis demonstrates persistent data structures in action: unchanged files share the same blob across commits.
Challenge
-
Create a subdirectory tree: add a
src/directory with amain.pyfile using plumbing commands. Hint: you'll need to create a tree forsrc/first, then include it as an entry in the root tree. -
Create a third commit that includes the
src/directory. Verify the directory structure withgit ls-tree -r HEAD. -
Run
git fsckto verify the integrity of your hand-built repository.
Cleanup
rm -rf ~/git-internals-labCommon Pitfalls & Troubleshooting
| Pitfall | Explanation |
|---|---|
| "Git tracks files, not directories" | Git only stores directory structure through tree objects, which require at least one file. Empty directories are invisible to Git. The convention is to place a .gitkeep or .keep file inside empty directories you want to track. |
| Confusing the hash with the content | The hash identifies content — it is not the content itself. Two different hashes always mean different content. Same hash always means same content. |
| "My commit has a different hash than yours" | Commit hashes include the author, timestamp, and parent — so two people committing identical file changes will always get different commit hashes. Only blob hashes are purely content-dependent. |
Thinking git gc deletes your data | Garbage collection packs loose objects and removes unreachable objects (ones no branch, tag, or reflog points to). It doesn't delete reachable data. Unreachable objects are kept for at least 2 weeks by default (configurable via gc.pruneExpire). |
Editing files in .git/ directly | Don't. Use plumbing commands instead. Manually editing object files will corrupt them (they're zlib-compressed and have headers). Manually editing refs/heads/main is technically safe but git update-ref is safer. |
Pro Tips
-
Use
git cat-file -pto debug anything. When something seems wrong with a commit, tree, or branch, inspect the raw objects. Understanding the database gives you X-ray vision into every Git problem. -
git fsckis your integrity checker. Run it if you suspect corruption (e.g., after a disk failure or interrupted operation). It walks the entire object graph and reports dangling, missing, or corrupt objects. -
Everything is a hash. Branch names, tag names,
HEAD,HEAD~3,main^2— these are all just ways to resolve to a commit hash. Rungit rev-parse <ref>to see the hash that any reference resolves to:git rev-parse HEAD git rev-parse main git rev-parse v1.0.0 -
The staging area is a tree draft. The index file (
.git/index) is a flattened tree structure — a draft of the tree that will be written by your next commit.git addupdates the index.git write-treeconverts it to a tree object. -
Blobs are content-addressed, not path-addressed. If you rename a file without changing its content, Git doesn't store a new blob. The new tree just maps a different filename to the same blob hash. This is how Git detects renames — by noticing that a blob disappeared from one path and appeared at another.
-
git count-objects -vreveals database health.git count-objects -v # count: 15 ← loose objects # size: 60 ← size in KB # in-pack: 1234 ← objects in packfiles # size-pack: 450 ← packfile size in KB # prune-packable: 0 ← loose objects already in packs (safe to prune) # garbage: 0 ← corrupt files
Quiz / Self-Assessment
1. What are the four types of objects Git stores in its database?
Answer
2. How does Git identify objects in its database?
Answer
3. Does Git store deltas (differences) or snapshots?
Answer
4. What does a tree object contain?
Answer
5. Where does Git physically store a blob with hash af5626b4a114abcb82d63db7c8082c3c4756e51b?
Answer
.git/objects/af/5626b4a114abcb82d63db7c8082c3c4756e51b. The first 2 characters of the hash form the subdirectory name, the remaining 38 characters form the filename. The file is zlib-compressed.
6. What is the difference between a blob and a tree?
Answer
7. What does git cat-file -t <hash> do?
Answer
blob, tree, commit, or tag.
8. Why can Git detect data corruption?
Answer
git fsck performs this check across the entire database.
9. What is the purpose of git gc (garbage collection)?
Answer
10. If you rename a file without changing its content, does Git create a new blob?
Answer