DevOps — Quick Summary
Quick revision: every topic, key terms, and mnemonics for DevOps.
This is a quick revision doc covering all 43 topics in the DevOps collection. Open the linked notes if you want depth — this is meant to re-cement what we already learned.
Linux Fundamentals
Linux Filesystem and Navigation
What it is. Linux follows the FHS (Filesystem Hierarchy Standard). Every distro lays out files in the same predictable directories — once we learn the layout, any Linux box feels familiar.
Key terms.
- /etc — config files (nginx.conf, ssh, cron live here)
- /var/log — system and app logs
- /home — user home directories
- /usr/bin — most user commands
- /tmp — temp files (cleared on reboot)
- /proc — virtual filesystem with process/system info
Commands.
pwd; ls -la; cd ~; cd -
cat / head -20 / tail -f file.log
grep -i "error" app.log; grep -rn "TODO" src/
find . -name "*.log" -mtime -1
ps aux | grep nginx | wc -l
echo "x" >> file; sort data > sorted; cmd 2> err.log
awk '{print $1, $3}' access.log
Remember. Config in /etc, logs in /var/log, our stuff in /home. Pipes are an assembly line. > overwrites, >> appends.
File Permissions and Ownership
What it is. Every file has owner/group/others, each with read/write/execute toggles. A 3x3 grid is the entire system.
Key terms.
- r=4, w=2, x=1 — octal values, add per group
- 755 — scripts/dirs (rwxr-xr-x)
- 644 — regular files (rw-r—r—)
- 600 — secrets (only owner)
- SUID (4xxx) — file runs as owner (e.g.
passwd) - SGID (2xxx) — dir’s new files inherit group
- Sticky bit (1xxx) — only file owner can delete (
/tmp)
Commands.
chmod 755 deploy.sh; chmod +x file
chmod u=rwx,g=rx,o= dir/
chown manish:developers file.txt
chown -R user:group /var/www
umask 0022 # files=644, dirs=755
Remember. rwx → 421, add them per group. 755/644/600 covers 95% of real-world cases.
Process Management
What it is. Every running program is a process. We view them, signal them, and let systemd babysit them.
Key terms.
- PID — process ID
- Process — independent program, own memory; Thread — runs inside a process, shared memory
- SIGTERM (15) — graceful “please stop”
- SIGKILL (9) — instant death, can’t be caught
- SIGHUP — reload config
- Zombie — child finished but parent didn’t reap exit status
- systemd — modern Linux service manager
Commands.
ps aux | grep nginx
top / htop
kill 1234 # SIGTERM
kill -9 1234 # SIGKILL (last resort)
nohup ./job.sh > out.log 2>&1 &
systemctl start|stop|restart|reload|status|enable nginx
journalctl -u nginx -f --since "1 hour ago"
Remember. Always try SIGTERM first, SIGKILL only when stuck. systemctl enable survives reboot, start runs now.
Shell Scripting Essentials
What it is. Shell scripts automate sequences of commands. Every DevOps engineer writes them daily.
Key terms.
- Shebang —
#!/bin/bashat top of file $1 $2 $#— args and arg count$?— last exit code (0 = success)$(cmd)— command substitutionset -euo pipefail— safe-script holy trinity (exit on error, unset vars error, pipe failures)[[ -f file ]]— modern test syntaxlocal— function-scoped variable
Code.
#!/bin/bash
set -euo pipefail
log() { echo "[$(date '+%H:%M')] $1"; }
for svc in nginx postgresql; do
if systemctl is-active --quiet "$svc"; then
log "OK $svc"
else
log "DOWN $svc"; systemctl restart "$svc"
fi
done
Remember. No spaces around =. Always quote "$var". set -euo pipefail saves us from silent bugs.
Package Management and System Services
What it is. apt/yum install software, systemctl manages services, cron schedules tasks.
Key terms.
- apt — Debian/Ubuntu; yum/dnf — RHEL/CentOS/Fedora
apt remove— keeps config files;apt purge— wipes config too- Cron format —
min hour day month weekday */5— every 5;0 2 * * *— daily at 2 AM
Commands.
sudo apt update && sudo apt install nginx
sudo systemctl enable --now nginx
crontab -e # */5 * * * * /opt/cron.sh >> /var/log/cron.log 2>&1
journalctl -u nginx --since "30 min ago" -f
Remember. Always apt update before install. After install: enable then start. Use crontab.guru to write cron expressions.
Networking Essentials
OSI Model and TCP/IP
What it is. Networking organized as layers, each with one job. OSI = 7 (theoretical), TCP/IP = 4 (practical).
Key terms.
- OSI layers — Physical, Data Link, Network, Transport, Session, Presentation, Application
- TCP/IP layers — Network Access, Internet, Transport, Application
- Encapsulation — each layer wraps data with its own header (segment → packet → frame)
- Devices — routers (L3), switches (L2), hubs (L1)
Mnemonic. OSI top-down: “All People Seem To Need Data Processing” (Application, Presentation, Session, Transport, Network, Data Link, Physical).
Remember. Layer 3 = routing/IP issues. Layer 4 = TCP/firewall/ports. Layer 7 = application/proxy errors. Pick the layer to debug.
DNS and Domain Resolution
What it is. DNS is the phone book of the internet — translates google.com to 142.250.80.46.
Key terms.
- A — domain → IPv4
- AAAA — domain → IPv6
- CNAME — alias to another domain (no root domain!)
- MX — mail server (with priority)
- TXT — arbitrary text (SPF, DKIM, verification)
- NS — authoritative nameservers
- TTL — cache duration in seconds
- Recursive resolver — ISP/Cloudflare/Google DNS that does the lookup work
Resolution flow. Browser cache → OS cache → recursive resolver → root → TLD (.com) → authoritative → answer.
Commands.
dig pman47.cc +short
dig @8.8.8.8 pman47.cc MX
dig pman47.cc +trace
nslookup -type=MX example.com
Remember. Lower TTL before migration. CNAMEs can’t sit at the root domain. /etc/hosts overrides DNS locally.
HTTP, HTTPS, and TLS
What it is. HTTP = how browsers and servers talk. HTTPS = HTTP wrapped in TLS encryption.
Key terms.
- Idempotent — same call N times = same result (GET/PUT/DELETE yes, POST/PATCH no)
- Status families — 1xx info, 2xx success, 3xx redirect, 4xx client, 5xx server
- 401 vs 403 — 401 “who are you”, 403 “I know you, you can’t do this”
- 502 vs 504 — 502 backend unreachable, 504 backend timed out
- TLS handshake — ClientHello + key share → ServerHello + cert → encrypted (TLS 1.3 = 1 RTT)
- Certificate — proves identity, signed by CA
- HTTP/2 — multiplexing over one connection; HTTP/3 — over QUIC/UDP
Status code cheatsheet.
| Code | Meaning |
|---|---|
| 200 | OK |
| 201 | Created |
| 204 | No Content |
| 301 | Moved Permanently |
| 302 | Found (temporary) |
| 304 | Not Modified |
| 400 | Bad Request |
| 401 | Unauthorized (not logged in) |
| 403 | Forbidden (not allowed) |
| 404 | Not Found |
| 429 | Too Many Requests |
| 500 | Internal Server Error |
| 502 | Bad Gateway |
| 503 | Service Unavailable |
| 504 | Gateway Timeout |
Remember. Methods that change state (POST, PATCH) aren’t idempotent. Let’s Encrypt + Caddy = free auto HTTPS.
TCP vs UDP
What it is. Two transport protocols. TCP is reliable, UDP is fast.
Key terms.
- TCP — connection-oriented, ordered, guaranteed delivery, retransmits, flow control
- UDP — connectionless, no guarantees, 8-byte header, fire and forget
- Three-way handshake — SYN → SYN-ACK → ACK
- Four-way teardown — FIN → ACK → FIN → ACK
- Window size — flow control buffer (receiver tells sender to slow)
- Well-known ports — 0-1023 (need root); ephemeral 49152-65535
Common ports. 22 SSH, 53 DNS, 80 HTTP, 443 HTTPS, 3306 MySQL, 5432 Postgres, 6379 Redis.
Remember. Web/email/SSH/DB → TCP. Video/gaming/DNS/VoIP → UDP. HTTP/3 runs UDP via QUIC (UDP getting reliability layered on top).
Load Balancing
What it is. Distributes traffic across multiple servers for scalability + HA.
Key terms.
- L4 — routes by IP+port, fast, no HTTP awareness
- L7 — routes by URL/headers/cookies, slower, smart
- Round Robin — turn by turn
- Weighted RR — bigger servers get more
- Least Connections — to whoever is least busy
- IP Hash — same client → same server (sticky)
- Health checks — active (LB pings) vs passive (LB watches errors)
- Sticky sessions — same user → same server (avoid; use Redis instead)
Tools. Nginx, HAProxy, Caddy, AWS ALB (L7) / NLB (L4), Traefik.
Remember. Most web traffic uses L7. Avoid sticky sessions — push state to Redis.
Networking Tools and Troubleshooting
What it is. The toolbox for “the site is down” diagnosis.
Key terms.
- curl — Swiss-army HTTP tool (
-vverbose,-Iheaders,-Lfollow redirects) - ping — ICMP reachability test
- traceroute — every hop on the path
- ss / netstat — what’s listening on ports
- tcpdump — raw packet capture
- iptables / ufw — firewall
Commands.
curl -v -I https://example.com
ping -c 4 google.com; traceroute -n google.com
ss -tlnp | grep :80
sudo tcpdump -i any port 443 -A -tttt
sudo ufw allow 22/tcp; sudo ufw enable
Debugging workflow. ping → dig → ss/curl → service logs (journalctl -u, docker logs) → resources (top, df -h, free -h).
Remember. Always curl -v first. Full disk = silent death (df -h early in any debug).
Docker & Containers
Containers vs Virtual Machines
What it is. VMs run a full OS on a hypervisor. Containers share the host kernel via namespaces + cgroups.
Key terms.
- Hypervisor — slices hardware among VMs (heavy, GBs, minutes to boot)
- Namespaces — give container its own view (process, network, mount, user)
- cgroups — limit CPU, memory, I/O per container
- Container — milliseconds to start, MBs in size, shares host kernel
Remember. VM = whole apartment. Container = room in a co-living space. Modern stacks run containers inside VMs.
Docker Images and Layers
What it is. An image is a stack of read-only layers. Container = image + thin writable layer.
Key terms.
- Layer — one Dockerfile instruction = one layer (cached, sharable)
- Registry — Docker Hub, GHCR, ECR, GCR, ACR
- Tag — mutable label (
nginx:alpine) - Digest — immutable SHA256 hash (
nginx@sha256:abc...) latesttag — dangerous in production; pin versions or digests
Commands.
docker history nginx:alpine
docker pull ghcr.io/pman47/gyaan:latest
docker tag my-app:latest ghcr.io/me/my-app:v1
docker push ghcr.io/me/my-app:v1
docker images --digests
Remember. Image = class. Container = object. Layers cache by hash; same base layer is stored once on disk.
Dockerfile Best Practices
What it is. Writing efficient, secure, cache-friendly image builds.
Key terms.
- FROM, WORKDIR, COPY, RUN, ENV, EXPOSE, CMD, ENTRYPOINT — core instructions
- CMD — default command, fully overridable
- ENTRYPOINT — fixed verb, args appended
- Multi-stage build — build in one image, copy artifacts to a smaller runtime image
- .dockerignore — exclude junk from
COPY . - Cache invalidation — change one layer = rebuild everything below
Code.
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM nginx:1.27-alpine
COPY --from=builder /app/dist /usr/share/nginx/html
USER node
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
Remember. Copy package.json BEFORE source code so deps cache survives code changes. Never bake secrets into images. Run as non-root.
Docker Networking
What it is. How containers talk to each other and the outside.
Key terms.
- bridge — default; custom bridge — adds DNS by name
- host — no isolation, container shares host net
- none — zero networking
- overlay — multi-host (Swarm/orchestration)
- Port mapping
-p HOST:CONTAINER
Commands.
docker network create my-net
docker run -d --name db --network my-net postgres:16
docker run -d --name app --network my-net -p 8080:3000 my-app
# app reaches db via hostname "db:5432"
docker network ls; docker network inspect my-net
Remember. Use custom bridge networks 90% of the time — they give DNS for free. Containers on the same network reach each other by name.
Docker Volumes and Storage
What it is. Containers are ephemeral. Volumes persist data.
Key terms.
- Volumes — Docker-managed, in
/var/lib/docker/volumes/, best for prod - Bind mounts — host path mounted in, best for dev (live-reload code)
- tmpfs — in-memory only, for sensitive temp data
Commands.
docker volume create pg-data
docker run -d -v pg-data:/var/lib/postgresql/data postgres:16
docker run -d -v $(pwd):/app node:20 # bind mount
Remember. Database data → named volumes (always). Source code in dev → bind mounts. tmpfs for secrets you don’t want on disk.
Docker Compose
What it is. Define multi-container apps in one YAML file. One command to start/stop everything.
Key terms.
- services — each container
- depends_on — start order (only waits for container start, not service ready)
- profiles — optionally run debug/dev services
- Service name = DNS hostname on default network
Code.
services:
api:
build: .
ports: ["3000:3000"]
environment:
DATABASE_URL: postgresql://app:secret@db:5432/myapp
depends_on: [db]
restart: unless-stopped
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD: secret
volumes:
- pg-data:/var/lib/postgresql/data
volumes:
pg-data:
Commands. docker compose up -d, down, logs -f, exec, ps, restart.
Remember. Service name IS the hostname. depends_on doesn’t wait for the app inside — use healthchecks for that.
Container Debugging and Commands
What it is. When containers crash or misbehave, here’s the toolbox.
Key terms.
docker ps -a— all containers including stoppeddocker logs -f --tail 50— recent logsdocker exec -it <c> sh— shell inside running containerdocker inspect— full JSON state- Exit code 137 — OOMKilled or
docker stop - Exit code 139 — segfault
docker stats— live CPU/memory
Crash workflow. docker ps -a → docker logs → docker inspect --format='{{.State.ExitCode}}' → docker run -it --entrypoint sh image:tag.
Cleanup. docker system prune -a --volumes (nuclear). docker system df (usage).
Remember. 137 = OOM. Run docker run -m 512m to up the limit. Always check logs before guessing.
Docker cheatsheet
| Command | Purpose |
|---|---|
docker run -d -p 8080:80 nginx | run detached, port-mapped |
docker exec -it <c> sh | shell into running container |
docker logs -f --tail 100 <c> | follow logs |
docker inspect <c> | full state JSON |
docker stats | live CPU/mem |
docker system prune -a | clean unused stuff |
docker compose up -d --build | start + rebuild |
Kubernetes
Kubernetes Architecture
What it is. Container orchestrator. We declare desired state, K8s makes it happen.
Key terms.
- Control plane — the brain (API Server, etcd, Scheduler, Controller Manager)
- API Server — front door for every kubectl/component
- etcd — distributed KV store, holds entire cluster state
- Scheduler — picks which node runs each Pod
- Controller Manager — control loops fixing drift
- Worker node — runs kubelet, kube-proxy, container runtime
- kubelet — agent that talks to API server, manages Pods on its node
- kube-proxy — sets iptables/IPVS for Service traffic
- CRI — Container Runtime Interface (containerd, CRI-O)
Remember. Control plane = brain, workers = hands. Everything goes through the API server. Components watch and react rather than calling each other directly.
Pods and Workloads
What it is. Pod = smallest deployable unit (1+ containers sharing network/storage). We use higher-level workloads.
Key terms.
- Pod phases — Pending, Running, Succeeded, Failed, Unknown
- Init container — runs before main container starts
- Sidecar — helper container alongside main app (proxy, log shipper)
- Deployment — stateless apps with rolling updates + rollback
- ReplicaSet — Deployment uses this internally to keep N Pods alive
- StatefulSet — stable hostnames, persistent storage, ordered (databases, queues)
- DaemonSet — one Pod per node (log/metric agents)
- Job / CronJob — run-to-completion / scheduled
Code.
apiVersion: apps/v1
kind: Deployment
metadata: { name: my-app }
spec:
replicas: 3
selector: { matchLabels: { app: my-app } }
template:
metadata: { labels: { app: my-app } }
spec:
containers:
- name: app
image: my-app:v2
Remember. Almost never create Pods directly. Deployment for stateless, StatefulSet for databases, DaemonSet for per-node, Job for batch.
Services and Networking
What it is. Pods are ephemeral with changing IPs. Services give stable endpoints.
Key terms.
- ClusterIP — internal only (default)
- NodePort — open static port (30000-32767) on every node
- LoadBalancer — provisions cloud LB (one per service = $$$)
- ExternalName — CNAME alias to external DNS
- Selector — labels match Pods to Service
- CoreDNS —
<svc>.<ns>.svc.cluster.local - Ingress — HTTP routing by host/path through one LB
- Ingress Controller — actual proxy (nginx-ingress, Traefik); Ingress = config
Remember. ClusterIP = internal default. Ingress = one cheap entry point routing to many services. LoadBalancer = expensive per-service cloud LB.
ConfigMaps and Secrets
What it is. Separate config from container images.
Key terms.
- ConfigMap — non-sensitive key-value config
- Secret — sensitive data, base64-encoded (NOT encrypted by default!)
- Inject as env vars (need restart) or volume mounts (auto-update)
- immutable: true — locks ConfigMap/Secret after creation
Code.
envFrom:
- configMapRef: { name: app-config }
- secretRef: { name: db-credentials }
Remember. Secrets are encoded, not encrypted. Real security needs encryption at rest in etcd or external (Vault, Sealed Secrets).
Persistent Volumes and Storage
What it is. Storage that survives Pod restarts.
Key terms.
- PV — actual disk (EBS, PD, NFS)
- PVC — request for storage (“I need 10Gi”)
- StorageClass — defines how to dynamically provision
- Access modes — RWO (one node R/W), ROX (many R), RWX (many R/W, needs NFS/EFS)
- Reclaim policy — Delete (auto cleanup) or Retain (keep data)
volumeClaimTemplate— each StatefulSet replica gets its own PVC
Remember. Block storage (EBS, PD) is usually RWO only. RWX needs NFS-style. StatefulSet + volumeClaimTemplate = standard DB pattern.
Resource Management and Scaling
What it is. Tell K8s how much CPU/memory we need so nothing starves.
Key terms.
- Requests — guaranteed minimum (used for scheduling)
- Limits — hard ceiling
- CPU over limit — throttled (slow but alive)
- Memory over limit — OOMKilled (terminated)
- QoS — Guaranteed (req=lim), Burstable (req<lim), BestEffort (none) — eviction order BestEffort first
- HPA — scales replica count by CPU/memory/custom
- VPA — right-sizes Pod requests
- Cluster Autoscaler — adds/removes nodes
- LimitRange — defaults per container; ResourceQuota — caps per namespace
Code.
resources:
requests: { cpu: "250m", memory: "128Mi" }
limits: { cpu: "500m", memory: "256Mi" }
Remember. CPU throttle, memory kill. Always set requests + limits in prod. 1 CPU = 1 core, 1000m = 1 core.
RBAC and Security
What it is. Lock down who/what can do what.
Key terms.
- Role — namespace-scoped permissions
- RoleBinding — binds Role to subjects
- ClusterRole / ClusterRoleBinding — cluster-wide
- ServiceAccount — identity for Pods (vs Users for humans)
- Pod Security Standards — Privileged / Baseline / Restricted
- SecurityContext — runAsNonRoot, readOnlyRootFilesystem, drop capabilities
- NetworkPolicy — Pod-level firewall (needs CNI like Calico/Cilium)
- Default deny pattern — start with
podSelector: {}ingress block, then allow
Remember. Principle of least privilege. Default-deny NetworkPolicy + drop ALL capabilities + non-root + read-only FS = serious hardening. RBAC denies trump allows.
kubectl cheatsheet
| Command | Purpose |
|---|---|
kubectl get pods -n ns | list pods |
kubectl describe pod <p> | full pod state |
kubectl logs -f <p> -c <ctnr> | follow logs |
kubectl exec -it <p> -- sh | shell in pod |
kubectl apply -f file.yaml | declarative apply |
kubectl rollout status deploy/x | watch rollout |
kubectl rollout undo deploy/x | rollback |
kubectl autoscale deploy x --min=2 --max=10 --cpu-percent=70 | quick HPA |
kubectl auth can-i list pods --as=... | check RBAC |
CI/CD & GitOps
CI/CD Fundamentals
What it is. Automate build/test/deploy. CI = catches bugs early. CD = ship safely + often.
Key terms.
- CI — every push triggers automated build + test
- Continuous Delivery — always deployable, human clicks deploy
- Continuous Deployment — every passing change goes to prod automatically
- Pipeline — stages: code → build → test → scan → deploy
Remember. Delivery = “we can deploy anytime.” Deployment = “we do deploy every time.” Smaller diffs are easier to debug.
Pipeline Design
What it is. Structuring fast, reliable pipelines.
Key terms.
- Stages — Lint → Build → Unit Test → Integration → Security → Deploy
- Artifacts — outputs passed between stages (no rebuilding)
- Caching —
node_modules, Docker layers,.m2 - Parallel jobs — run independent stages concurrently
- Matrix builds — test multiple versions/OSes in parallel
Code.
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20, cache: npm }
- run: npm ci && npm run lint
test:
needs: lint
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci && npm test
deploy:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- run: ./deploy.sh
Remember. Fastest checks first. Keep total under 10 min — beyond that nobody waits. Cache aggressively.
Deployment Strategies
What it is. Safe ways to ship code with rollback paths.
Key terms.
- Rolling — replace instances one at a time (K8s default, both versions briefly live)
- Blue-Green — two envs, flip LB. Instant rollback. 2x cost.
- Canary — 5% → 25% → 100% based on metrics. Safest. Needs observability.
- Recreate — kill old, start new. Has downtime. Dev only.
- Feature flags — deploy code disabled, toggle for users without redeploy
Strategy comparison.
| Strategy | Downtime | Rollback | Cost | When |
|---|---|---|---|---|
| Recreate | Yes | Slow | 1x | Dev/staging |
| Rolling | None | Slow | 1x | Default for K8s |
| Blue-Green | None | Instant | 2x | Need fast rollback |
| Canary | None | Fast | ~1.1x | High-traffic, good monitoring |
Remember. Default to rolling. Use canary when stakes are high and metrics are good.
GitOps and ArgoCD
What it is. Git is the source of truth. Cluster pulls desired state from Git.
Key terms.
- Push — pipeline kubectl-applies (needs cluster creds)
- Pull — agent inside cluster watches Git (more secure)
- ArgoCD Application — CRD pointing at Git repo + path → cluster + namespace
- Auto-sync — applies on Git change
- Prune — deletes resources removed from Git
- Self-heal — reverts manual cluster changes
- Drift detection — actual vs Git state
Remember. “If it’s not in Git, it doesn’t exist.” Every change = PR. Audit log = git log.
Artifact Management and Registries
What it is. Store and version build outputs.
Key terms.
- Artifact — built output (image, JAR, npm package, binary, Helm chart)
- Registry — Docker Hub, GHCR, ECR, GCR, GAR, ACR, Harbor
- Tagging — semver (
v1.2.3), Git SHA, neverlatestin prod - Trivy — open-source vulnerability scanner
- Cosign — image signing (Sigstore)
Code.
docker build -t ghcr.io/me/app:v1.2.3 -t ghcr.io/me/app:$(git rev-parse --short HEAD) .
docker push ghcr.io/me/app:v1.2.3
trivy image --exit-code 1 --severity CRITICAL ghcr.io/me/app:v1.2.3
cosign sign ghcr.io/me/app:v1.2.3
Remember. Tag with both semver AND git SHA. Never latest in prod manifests. Scan in CI; sign before deploy.
Cloud & Infrastructure
Cloud Computing Models
What it is. Spectrum from “we manage everything” (IaaS) to “we manage nothing” (SaaS).
Key terms.
- IaaS — VMs, networking (EC2, Compute Engine, Droplets)
- PaaS — runtime managed (Heroku, App Engine, Railway)
- Serverless — function-as-a-service (Lambda, Cloud Functions, Workers)
- SaaS — finished software (Gmail, Slack, GitHub)
- Shared responsibility — provider secures the cloud, we secure what’s IN the cloud
- Multi-cloud — multiple providers; Hybrid — on-prem + cloud
Remember. Higher in the stack = less control, less ops. Real architectures mix all four.
VPC and Network Architecture
What it is. Our isolated private network in the cloud.
Key terms.
- VPC (AWS) / VPC Network (GCP) / VNet (Azure)
- Public subnet — has Internet Gateway route, public IPs allowed (LB, bastion, NAT)
- Private subnet — no direct internet, outbound via NAT (apps, DBs)
- Internet Gateway — door to public internet
- NAT Gateway — private subnet → outbound only
- Route Table — rules per subnet
- Security Group — instance-level, stateful (response auto-allowed)
- NACL — subnet-level, stateless (must allow both directions)
- VPC Peering — private connection between VPCs
Remember. SG = stateful = apartment door. NACL = stateless = building gate. Public for LB+NAT, private for apps+DB.
IAM and Access Management
What it is. Who can do what on which resources.
Key terms.
- User — human; Group — collection of users
- Role — temporary identity anyone can assume
- Policy — JSON with Effect/Action/Resource
- Principal — who’s making the request
- Least privilege — give the minimum needed
- Assume role — apps get temp credentials (Instance Profile / Service Account / Managed Identity)
- MFA — required for humans, especially root
- Audit — CloudTrail (AWS) / Cloud Audit Logs (GCP)
Code.
{ "Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"] }
Remember. Deny always wins. Never Action: "*" for app roles. Use roles, not access keys, in code. MFA on the root account, then lock it away.
Cloud Storage and Databases
What it is. Different storage types for different jobs.
Key terms.
- Object (S3, GCS, Blob) — files via HTTP, unlimited, cheapest. For uploads, backups, static assets.
- Block (EBS, PD, Managed Disks) — virtual disk, one VM, fast IOPS. For OS, databases.
- File (EFS, Filestore, Files) — shared NFS, multiple VMs.
- Managed DB — RDS/Cloud SQL (SQL), DynamoDB/Firestore/Cosmos (NoSQL).
- Cache — ElastiCache, Memorystore (Redis/Memcached).
Remember. S3 for files, EBS for disks, RDS for SQL, Redis for hot reads. Wrong choice = expensive and slow.
Serverless and Managed Services
What it is. Functions triggered by events. Pay per invocation. Scale to zero.
Key terms.
- Lambda / Cloud Functions / Workers — code runners
- API Gateway — HTTP front door for Lambda
- SQS — queue (decouple, retry)
- SNS — pub/sub fan-out
- EventBridge — event router
- Cold start — first invocation = 100ms-2s spin up
- Limits — Lambda max 15 min execution
When YES. Sporadic workloads, event processing, cron jobs, variable traffic APIs.
When NO. Latency-critical APIs (cold starts), long jobs (>15 min), steady high throughput (containers cheaper).
Remember. Free when idle, expensive at huge scale. Watch out for cold starts and vendor lock-in.
Infrastructure as Code
IaC Concepts and Benefits
What it is. Infrastructure defined in code, stored in Git, applied by tools.
Key terms.
- Declarative — describe end state (Terraform, CloudFormation)
- Imperative — describe steps (bash, AWS CLI)
- Idempotency — running twice = same result
- Reproducibility — same code → same infra
- Tools: Terraform, Pulumi, CloudFormation, Ansible
Remember. Declarative = ordering food. Imperative = giving cooking instructions. Most modern IaC is declarative.
Terraform Fundamentals
What it is. Declarative IaC tool by HashiCorp. Uses HCL.
Key terms.
- Provider — plugin (aws, google, azurerm, cloudflare)
- Resource — actual thing (
aws_s3_bucket) - Variable — input (with type + default)
- Output — exposed value
- Data source — read existing resource
- Workflow —
init→plan→apply→destroy
Code.
provider "aws" { region = "ap-south-1" }
variable "bucket_name" { type = string }
resource "aws_s3_bucket" "assets" {
bucket = var.bucket_name
tags = { ManagedBy = "terraform" }
}
output "bucket_arn" { value = aws_s3_bucket.assets.arn }
Remember. ALWAYS read terraform plan before apply. resource creates, data reads.
Terraform State and Modules
What it is. State file maps config → real resources. Modules = reusable packages.
Key terms.
- State —
terraform.tfstateJSON, Terraform’s memory - Local state — fine for solo, broken for teams
- Remote backend — S3 + DynamoDB (lock) standard
- State locking — prevents concurrent apply
- Drift detection — state vs reality
- Workspaces — separate state per env (dev/stage/prod)
- Module — directory with main.tf/variables.tf/outputs.tf
- Reference:
module "x" { source = "./mod"; ... }
State commands. terraform state list / show / rm / mv.
Remember. Remote backend with locking = day-1 setup. Modules = stop copy-pasting. state rm forgets but doesn’t delete.
Ansible Basics
What it is. Agentless config management over SSH. Configures servers AFTER they exist.
Key terms.
- Inventory — list of hosts (grouped)
- Playbook — YAML of tasks for hosts
- Module — built-in action (apt, copy, service, template, user, docker_container)
- Handler — task triggered by
notify:(e.g., restart nginx) - Role — packaged tasks/files/templates/handlers/defaults
- Idempotent —
state: present,state: started— does nothing if already so - Galaxy — npm-like for roles
Code.
- name: Setup web
hosts: webservers
become: true
tasks:
- apt: { name: nginx, state: present, update_cache: true }
- copy: { src: nginx.conf, dest: /etc/nginx/nginx.conf }
notify: restart nginx
- service: { name: nginx, state: started, enabled: true }
handlers:
- name: restart nginx
service: { name: nginx, state: restarted }
Remember. Terraform builds the house, Ansible furnishes it. No agents needed — just SSH.
Observability & Reliability
Monitoring and Alerting
What it is. Watch infra + apps, alert on real problems before users notice.
Key terms.
- Counter — only goes up (total requests)
- Gauge — up and down (current memory)
- Histogram — distribution (p50, p95, p99 latency)
- Summary — client-side histogram
- Prometheus — pull-based TSDB, scrapes
/metrics, queries with PromQL - Grafana — dashboards
- Alertmanager — routes alerts to Slack/PagerDuty/email
- USE method (infra) — Utilization, Saturation, Errors
- RED method (services) — Rate, Errors, Duration
PromQL.
rate(http_requests_total[5m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Alert rules. Alert on symptoms not causes. Every alert actionable. Severities: critical/warning/info. Include runbook links.
Remember. USE for infra, RED for services. Alert fatigue is the real enemy — fewer good alerts > many noisy ones.
Logging and Log Aggregation
What it is. Centralize all logs, structured as JSON, searchable.
Key terms.
- Structured logging — JSON with consistent fields
- Log levels — DEBUG (dev only), INFO (normal), WARN (handled), ERROR (needs attention), FATAL (crash)
- ELK — Elasticsearch (store), Logstash (collect/parse), Kibana (UI)
- EFK — replaces Logstash with Fluentd (lighter, K8s-native)
- Correlation ID / Trace ID — unique ID per request, threaded through every service
- Log rotation —
logrotate, Docker--log-opt max-size - ILM — Elasticsearch index lifecycle (hot → cold → delete)
Remember. Always log structured JSON. Generate correlation ID at entry, pass it through every call. Hot logs in ES, cold in S3/Glacier.
Secrets Management and TLS
What it is. Keep passwords/tokens out of code; encrypt traffic.
Key terms.
- Env vars — fine for dev, leaky for prod (visible in
/proc, logs) - Vault — HashiCorp’s secret store, audit + RBAC + dynamic secrets
- Dynamic secrets — Vault creates short-lived DB users on demand
- Cloud secrets — AWS Secrets Manager, GCP Secret Manager, Key Vault
- Sealed Secrets — encrypt secrets in Git, only cluster decrypts
- TLS — encrypts data in transit; cert + private key + CA
- cert-manager — auto-issues + renews Let’s Encrypt certs in K8s
- Caddy — auto-HTTPS web server
- mTLS — both sides present certs (service mesh: Istio, Linkerd)
Remember. Three rules: never in code, always encrypted, rotate regularly. Short-lived > rotation. Caddy/cert-manager remove cert-renewal headaches.
High Availability and Disaster Recovery
What it is. Stay running through failures. Recover from disasters.
Key terms.
- SPOF — single point of failure (find them, eliminate them)
- Active-Active — all nodes serve traffic
- Active-Passive — standby takes over (DB primary-replica)
- RPO — Recovery Point Objective = how much data we can lose
- RTO — Recovery Time Objective = how fast we recover
- Backups — full / incremental / differential
- Liveness probe — is process alive? (restart on fail)
- Readiness probe — can it serve traffic? (remove from LB on fail)
- Chaos engineering — intentionally break things in controlled ways (Chaos Monkey)
RPO/RTO mnemonic. RPO is “how far back” (data loss tolerance). RTO is “how long down” (downtime tolerance). Lower = pricier.
Remember. Untested backup is not a backup. Test restore drills. Health checks + auto-failover = invisible recovery. Multi-region = serious HA.