Cloud networking is the quiet layer behind most “why can’t it connect?” incidents. A service is healthy, logs look fine, yet traffic never arrives. The fix is usually not magic — it’s understanding the path: DNS → routing → firewalls (security groups/NACLs) → the app port. This guide explains the building blocks (VPCs, subnets, NAT, security groups) and gives you a repeatable troubleshooting flow.
Quickstart
Use this when a workload “should” connect but doesn’t. The goal is to eliminate whole classes of issues quickly, before you deep-dive into cloud console pages.
The 5-minute connectivity checklist
- Confirm DNS: does the hostname resolve to the IP you expect?
- Confirm route: is there a route from source subnet to destination (or to NAT/IGW for internet)?
- Confirm firewall: do security groups allow the port (and does the destination allow inbound from the source)?
- Confirm return path: for private traffic, does the destination subnet route back? (Asymmetric routing bites.)
- Confirm the app: is the process listening on the port and bound to the right interface?
Fast symptom-to-layer mapping
| Symptom | Usually means | Check first |
|---|---|---|
| Hostname doesn’t resolve | DNS / name not published | Private DNS settings, hosted zone, resolver rules |
| Timeout (hangs) | Routing or firewall drop | Route tables, NACLs, security groups |
| Immediate “connection refused” | Port reachable; app not listening | Service binding, target port, health checks |
| Works from one subnet, not another | Subnet routes / SG source mismatch | Source CIDR rules, subnet association with route table |
| Outbound works; inbound doesn’t | Public exposure missing or blocked | IGW/route, public IP, inbound SG/NACL |
Every connection is a path with gates. If you list the gates in order, you’ll fix issues faster: name → route → allowlist → listener.
Temporarily setting inbound rules to 0.0.0.0/0 (or opening SSH/RDP) is a common panic move — and a common security incident. Prefer a controlled test source (your VPN/bastion) and narrow, time-boxed rules.
Overview
Cloud networking basics are the foundation for reliable deployments: private application tiers, secure databases, stable outbound access, and predictable traffic flow. The vocabulary can feel intimidating (VPCs, subnets, NAT, security groups), but the core ideas are simple: define address space, split it into zones, control routes, and enforce allowlists.
What this post covers
- How a VPC and subnets map to real traffic paths
- Public vs private subnets, and how NAT fits into outbound access
- Security groups and what “stateful” means in practice
- Route tables, internet gateways, and the most common misconfigurations
- A step-by-step mini design you can reuse for typical web apps
Why it matters
- Most production outages are “small” networking mistakes with big consequences
- Good defaults (private by default, least privilege, clear routes) reduce incident load
- Networking choices impact cost (NAT, cross-zone traffic) and security posture
- Debugging becomes fast when you know which layer can cause which symptom
The goal: a network that’s easy to reason about
A “good” network is not the one with the most features — it’s the one you can explain on a whiteboard in 60 seconds and troubleshoot under pressure. You’ll get there by keeping the design consistent: standard CIDRs, clear public/private split, and security rules aligned with application boundaries.
Core concepts
VPC (or VNet): your private address space
A VPC is a logically isolated network where you choose a CIDR range (like 10.0.0.0/16).
Everything inside gets private IPs from that range. Think of it as “your data center network,” but with programmable routing and firewalls.
CIDR and subnets: how you carve up the space
CIDR notation defines how big an IP block is. A /16 is large; a /24 is much smaller.
You split a VPC into subnets (often per Availability Zone) so you can control routing and isolation boundaries.
| Concept | What it controls | Common default | Typical pitfall |
|---|---|---|---|
| VPC CIDR | Total IP space | 10.0.0.0/16 | Overlaps with on-prem/VPN ranges |
| Subnet CIDR | Placement + routing domain | /24 per AZ | Too small (IP exhaustion) or inconsistent per env |
| Route table | Where packets can go | Public vs private route tables | Subnet associated with the wrong route table |
| Security group | Instance/ENI allowlist | Least privilege | Opening broad inbound “temporarily” |
Public vs private subnets
A subnet becomes “public” when it has a route to an Internet Gateway and instances can receive public IPs. A “private” subnet has no direct route to the internet. Private subnets are where you put databases, internal services, and most app workloads.
“Public subnet” describes routing. Whether a workload is reachable depends on security groups, load balancers, and public IPs. You can run workloads in a public subnet that are still not reachable if inbound rules block them.
Internet Gateway vs NAT
- Internet Gateway (IGW): enables inbound/outbound internet for public subnets (when routes allow it).
- NAT (gateway/instance): enables outbound-only internet from private subnets. NAT is for updates, package installs, external APIs — not for inbound traffic.
Security groups vs network ACLs
The two most common “it should work” blockers are security groups and NACLs. The key difference is how they behave.
| Control | Applies to | State | How failures look |
|---|---|---|---|
| Security group | Instances/ENIs (workload-level) | Stateful (return traffic allowed) | Timeouts; only specific ports blocked |
| Network ACL | Subnet boundary (network-level) | Stateless (must allow both directions) | Timeouts; can break “randomly” due to ephemeral ports |
A simple mental model: the packet’s journey
When you’re stuck, walk the packet:
- Name: resolve hostname to IP (DNS).
- Source policy: does the source allow egress to that destination/port?
- Route: does the source subnet know where to send it (local, peering, NAT, IGW)?
- Destination policy: does the destination allow ingress from that source/port?
- App: is something listening and healthy behind the destination?
Timeouts usually mean a firewall/routing drop. “Connection refused” usually means the port is reachable but the app is not accepting connections.
Step-by-step
Let’s build a “classic” layout: public entry + private app tier + private database tier. This pattern translates across providers (AWS VPC / Azure VNet / GCP VPC), even if the names differ.
Step 1: Plan CIDRs and tiers
A sane starter plan
- One VPC per environment: dev, staging, prod (or at least separate prod)
- Non-overlapping CIDRs across environments (helps with VPN/peering later)
- Two or three AZs for availability
- Subnets per AZ: public and private (optionally split private into app/db)
Design gotchas to avoid
- Overlapping CIDRs with on-prem, other VPCs, or partner networks
- Subnets too small (IP exhaustion happens faster than expected)
- Mixing unrelated apps in one security boundary (“everything can talk to everything”)
- Relying on “temporary” open rules that become permanent
Step 2: Route tables, IGW, and NAT (make traffic intentional)
Routing decides where packets can go. The simplest stable setup uses two route tables: a public route table (with a default route to an internet gateway) and a private route table (with a default route to NAT).
The minimum routing rules to remember
- Public subnet: default route (
0.0.0.0/0) → IGW - Private subnet: default route (
0.0.0.0/0) → NAT (for outbound), no direct IGW - Local VPC CIDR: routed internally automatically (east-west inside the VPC)
Step 3: Security groups (least privilege that still works)
Security groups are where you encode “who can talk to whom.” A practical approach is to model your application layers: load balancer → app → database. Each layer gets its own security group.
Layered rules you can reuse
- LB SG: inbound 80/443 from the internet (or from your CDN), outbound to app SG
- App SG: inbound from LB SG on app port, outbound to DB SG on DB port and to required external services
- DB SG: inbound only from app SG on DB port, no public inbound
Common rule mistakes
- Opening DB ports to a broad CIDR instead of referencing the app SG
- Forgetting egress restrictions (or forgetting that some platforms default to “allow all egress”)
- Allowing SSH/RDP from the internet rather than from a controlled admin path
- Mixing environments in one SG (“dev can reach prod”)
Step 4: A minimal VPC layout as code (example)
The snippet below shows a compact AWS-style layout: VPC, one public subnet, one private subnet, IGW, NAT, and security groups. Use it as a mental model even if you’re on another cloud — the pieces are the same. (Keep production setups multi-AZ; this is minimal on purpose.)
terraform {
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
}
provider "aws" {
region = var.region
}
variable "region" { type = string, default = "eu-central-1" }
variable "vpc_cidr" { type = string, default = "10.20.0.0/16" }
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_support = true
enable_dns_hostnames = true
tags = { Name = "unilab-net" }
}
resource "aws_internet_gateway" "igw" {
vpc_id = aws_vpc.main.id
tags = { Name = "unilab-igw" }
}
resource "aws_subnet" "public_a" {
vpc_id = aws_vpc.main.id
cidr_block = "10.20.10.0/24"
availability_zone = "${var.region}a"
map_public_ip_on_launch = true
tags = { Name = "public-a" }
}
resource "aws_subnet" "private_a" {
vpc_id = aws_vpc.main.id
cidr_block = "10.20.20.0/24"
availability_zone = "${var.region}a"
tags = { Name = "private-a" }
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.igw.id
}
tags = { Name = "rt-public" }
}
resource "aws_route_table_association" "public_a" {
subnet_id = aws_subnet.public_a.id
route_table_id = aws_route_table.public.id
}
resource "aws_eip" "nat" {
domain = "vpc"
tags = { Name = "eip-nat" }
}
resource "aws_nat_gateway" "nat" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public_a.id
tags = { Name = "nat-a" }
depends_on = [aws_internet_gateway.igw]
}
resource "aws_route_table" "private" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.nat.id
}
tags = { Name = "rt-private" }
}
resource "aws_route_table_association" "private_a" {
subnet_id = aws_subnet.private_a.id
route_table_id = aws_route_table.private.id
}
# Security groups (LB -> App -> DB pattern)
resource "aws_security_group" "app" {
name = "sg-app"
vpc_id = aws_vpc.main.id
ingress {
description = "App port from LB"
from_port = 8080
to_port = 8080
protocol = "tcp"
cidr_blocks = ["10.20.10.0/24"] # or reference an LB security group in real setups
}
egress {
description = "Allow outbound (tighten per your needs)"
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = { Name = "sg-app" }
}
resource "aws_security_group" "db" {
name = "sg-db"
vpc_id = aws_vpc.main.id
ingress {
description = "DB from app"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.app.id]
}
tags = { Name = "sg-db" }
}
For production, place private subnets in multiple AZs and consider NAT per AZ to avoid cross-AZ dependencies. Keep the design consistent across environments so “it works in staging” means something.
Step 5: Troubleshoot with repeatable probes
When connectivity fails, use probes that tell you which layer is failing: DNS, TCP reachability, TLS, and HTTP. The commands below are safe and fast, and work from most Linux containers/VMs.
#!/usr/bin/env bash
set -euo pipefail
# Usage:
# ./netcheck.sh example.internal 5432
# ./netcheck.sh api.example.com 443
#
# These checks help you identify: DNS issue vs route/firewall drop vs app not listening.
HOST="${1:?host required}"
PORT="${2:-443}"
echo "== DNS resolution =="
( command -v dig >/dev/null && dig +short "$HOST" ) || ( getent hosts "$HOST" || true )
echo
echo "== TCP reachability (timeout means drop; refused means app not listening) =="
if command -v nc >/dev/null; then
nc -vz -w 3 "$HOST" "$PORT" || true
else
# Bash TCP check fallback
timeout 3 bash -c "cat </dev/null >/dev/tcp/$HOST/$PORT" && echo "TCP: OK" || echo "TCP: FAIL"
fi
echo
echo "== TLS/HTTP probe (if applicable) =="
if [[ "$PORT" == "443" || "$PORT" == "8443" ]]; then
curl -sS -m 5 -I "https://$HOST:$PORT" || true
else
curl -sS -m 5 -I "http://$HOST:$PORT" || true
fi
echo
echo "== Trace route hint (may be blocked by firewalls) =="
( command -v traceroute >/dev/null && traceroute -n -w 1 -q 1 "$HOST" | head -n 8 ) || true
echo
echo "Next steps if it fails:"
echo " - DNS fails: check private DNS/hosted zone/resolver rules"
echo " - TCP timeout: check route table + SG/NACL (drops look like timeouts)"
echo " - Refused: app/service not listening or wrong target port"
Step 6: Going beyond basics (when your network grows)
Once you have the basics stable, you’ll encounter “real-world” requirements: private access to managed services, hybrid connectivity, and multi-network topologies. Here are the most common next steps and when to use them:
| Need | Typical solution | What to watch out for |
|---|---|---|
| Private access to cloud APIs (no public internet) | VPC endpoints / PrivateLink / service endpoints | DNS settings and endpoint policies |
| Connect two networks privately | VPC peering / transit gateway / hub-spoke | Overlapping CIDRs, route propagation, governance |
| On-prem connectivity | VPN / Direct Connect / ExpressRoute / Interconnect | Routing (BGP), MTU, firewall ownership |
| Traffic visibility | Flow logs, load balancer logs, packet mirroring | Cost and retention; make logs searchable |
Bonus: intra-cluster networking still matters
Even with perfect VPC settings, platforms like Kubernetes add another layer (pod-to-pod traffic, network policies). The pattern is similar: default deny, then allow the specific flows your app needs.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-allow-db
namespace: default
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: ingress
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: db
ports:
- protocol: TCP
port: 5432
Common mistakes
These are the issues that show up again and again in incident reviews. If you recognize your setup in one of these, you’ve found a great place to fix first.
Putting databases in public subnets
Public routing increases the chance of accidental exposure.
- Fix: keep DBs in private subnets and allow inbound only from app security groups.
- Fix: use a bastion/SSM/VPN for admin access instead of public inbound rules.
Wrong route table association
A subnet can be “private” only until it gets the public route table.
- Fix: audit which subnets are associated with which route tables.
- Fix: name things consistently:
rt-public,rt-private,public-a,private-a.
Opening wide inbound rules “temporarily”
The temporary rule often becomes the permanent liability.
- Fix: restrict inbound to known admin CIDRs, VPN, or bastion security groups.
- Fix: add time-boxed change controls (tickets, expiry tags, automated cleanup).
Forgetting stateless rules (NACLs) and ephemeral ports
Stateless filtering can break return traffic if you only allow “the server port.”
- Fix: prefer security groups for most filtering; keep NACLs simple if you use them.
- Fix: when you must use NACLs, allow return traffic (ephemeral ports) explicitly.
Treating NAT as a generic “internet switch”
NAT is outbound only; it won’t make private workloads reachable from outside.
- Fix: for inbound access, use a load balancer, public endpoint, or VPN/bastion pattern.
- Fix: for private-only services, use internal load balancers and private DNS.
DNS mismatches (public vs private names)
The service is up, but clients are resolving the wrong address.
- Fix: decide which names are public and which are private; document it.
- Fix: verify resolver behavior from inside the VPC (not from your laptop).
The easiest networks to run are the ones with few exceptions. Standardize subnets, standardize route tables, standardize security groups by tier, and keep naming consistent across environments.
FAQ
What’s the difference between a VPC and a subnet?
A VPC is the full private network and address space (CIDR range). A subnet is a slice of that space, usually tied to an AZ, where you attach routing and apply subnet-level controls. You place workloads into subnets to control exposure and traffic flow.
How do I know if a subnet is “public” or “private”?
A subnet is effectively public if its route table has a default route (0.0.0.0/0) to an Internet Gateway and instances can have public IPs.
A subnet is private if it does not route directly to the IGW. Private subnets often route outbound through NAT instead.
What does NAT actually do?
NAT enables outbound internet access for private subnets by translating private source IPs to a public IP on the way out. NAT does not enable inbound internet access to private instances. For inbound access, use a load balancer, public service, or a private admin path (VPN/bastion/SSM).
Are security groups stateful, and why do I care?
In many clouds, security groups are stateful: if you allow inbound to a port, return traffic is allowed automatically. This makes them easier to reason about than stateless controls (like many NACLs), which require explicit rules in both directions.
Why do I get timeouts instead of explicit errors?
Firewalls and routing drops usually produce timeouts because packets are silently discarded. That’s why the troubleshooting flow starts with DNS, then probes TCP reachability, then checks route tables and security rules.
What’s a safe default security group strategy?
Use layered security groups aligned to your architecture: an inbound-facing layer (LB), an app layer, and a data layer. Allow only the required ports between layers, and avoid broad CIDR allowlists when you can reference security groups directly.
When should I use VPC endpoints / PrivateLink?
Use them when you want private access to managed services (object storage, secrets, APIs) without routing through the public internet. They reduce exposure and can simplify egress control — but still require careful DNS and policy configuration.
Cheatsheet
Keep this nearby for incident response. It’s designed to help you identify the failing layer fast and apply the right fix without guesswork.
Connectivity debugging order
- DNS: resolve the hostname from inside the VPC/subnet
- Port probe: is TCP reachable? (timeout vs refused)
- Routes: source subnet route table and destination route back
- Security groups: source egress + destination ingress for the port
- NACLs: if used, verify both directions (ephemeral ports)
- App/listener: health checks, binding address, target port
Default architecture checklist
- Private subnets for app and DB tiers
- Public subnets only for ingress components (LB, NAT)
- Two route tables: public (IGW), private (NAT)
- Layered security groups: LB → app → DB
- Private DNS names for internal services
- Flow logs enabled (with reasonable retention)
Quick reference: what each component is for
| Component | Primary purpose | Typical mistake | Fix |
|---|---|---|---|
| VPC/VNet | Private network boundary + CIDR space | Overlapping CIDRs | Plan CIDRs upfront; document allocations |
| Subnet | Placement + routing domain | Wrong route table attached | Audit associations; standardize naming |
| Route table | Where traffic can go | Missing default route (NAT/IGW) | Add correct routes per subnet type |
| Internet Gateway | Public inbound/outbound capability | Accidental exposure via public routing | Keep only ingress tier in public subnets |
| NAT | Private subnet outbound internet | Expecting inbound to work via NAT | Use LB/VPN/bastion for inbound admin |
| Security group | Workload allowlist (stateful) | Wide inbound rules | Least privilege; reference SGs |
When in doubt, sketch the flow: client → LB → app → DB. Then label the gates: route tables and security rules at each hop. That sketch becomes your runbook.
Wrap-up
Cloud networking basics aren’t about memorizing provider-specific names — they’re about understanding the path. Once you internalize DNS → routing → security groups/NACLs → listener, most connectivity issues become quick, mechanical fixes. Keep your network simple, private by default, and consistent across environments, and troubleshooting becomes a checklist instead of a mystery.
Next actions
- Audit your subnets: which are public, which are private, and which route tables they use
- Review security groups: ensure tiered access (LB → app → DB) and remove broad inbound rules
- Enable flow logs (and set retention) so timeouts aren’t blind spots
- Create a short incident runbook using the cheatsheet debugging order
Want to connect networking to the rest of your platform? The related posts cover Terraform structure, cost drivers like NAT/egress, and Kubernetes fundamentals that show up in real environments.
Quiz
Quick self-check (demo). This quiz is auto-generated for cloud / devops / networking.