Cloud & DevOps · Networking

Cloud Networking Basics: VPCs, Subnets, NAT, and Security Groups

The building blocks behind every ‘why can’t it connect?’ issue.

Reading time: ~8–12 min
Level: All levels
Updated:

Cloud networking is the quiet layer behind most “why can’t it connect?” incidents. A service is healthy, logs look fine, yet traffic never arrives. The fix is usually not magic — it’s understanding the path: DNS → routing → firewalls (security groups/NACLs) → the app port. This guide explains the building blocks (VPCs, subnets, NAT, security groups) and gives you a repeatable troubleshooting flow.


Quickstart

Use this when a workload “should” connect but doesn’t. The goal is to eliminate whole classes of issues quickly, before you deep-dive into cloud console pages.

The 5-minute connectivity checklist

  • Confirm DNS: does the hostname resolve to the IP you expect?
  • Confirm route: is there a route from source subnet to destination (or to NAT/IGW for internet)?
  • Confirm firewall: do security groups allow the port (and does the destination allow inbound from the source)?
  • Confirm return path: for private traffic, does the destination subnet route back? (Asymmetric routing bites.)
  • Confirm the app: is the process listening on the port and bound to the right interface?

Fast symptom-to-layer mapping

Symptom Usually means Check first
Hostname doesn’t resolve DNS / name not published Private DNS settings, hosted zone, resolver rules
Timeout (hangs) Routing or firewall drop Route tables, NACLs, security groups
Immediate “connection refused” Port reachable; app not listening Service binding, target port, health checks
Works from one subnet, not another Subnet routes / SG source mismatch Source CIDR rules, subnet association with route table
Outbound works; inbound doesn’t Public exposure missing or blocked IGW/route, public IP, inbound SG/NACL
Think in “paths,” not in services

Every connection is a path with gates. If you list the gates in order, you’ll fix issues faster: name → route → allowlist → listener.

Avoid the “open it to the world” reflex

Temporarily setting inbound rules to 0.0.0.0/0 (or opening SSH/RDP) is a common panic move — and a common security incident. Prefer a controlled test source (your VPN/bastion) and narrow, time-boxed rules.

Overview

Cloud networking basics are the foundation for reliable deployments: private application tiers, secure databases, stable outbound access, and predictable traffic flow. The vocabulary can feel intimidating (VPCs, subnets, NAT, security groups), but the core ideas are simple: define address space, split it into zones, control routes, and enforce allowlists.

What this post covers

  • How a VPC and subnets map to real traffic paths
  • Public vs private subnets, and how NAT fits into outbound access
  • Security groups and what “stateful” means in practice
  • Route tables, internet gateways, and the most common misconfigurations
  • A step-by-step mini design you can reuse for typical web apps

Why it matters

  • Most production outages are “small” networking mistakes with big consequences
  • Good defaults (private by default, least privilege, clear routes) reduce incident load
  • Networking choices impact cost (NAT, cross-zone traffic) and security posture
  • Debugging becomes fast when you know which layer can cause which symptom

The goal: a network that’s easy to reason about

A “good” network is not the one with the most features — it’s the one you can explain on a whiteboard in 60 seconds and troubleshoot under pressure. You’ll get there by keeping the design consistent: standard CIDRs, clear public/private split, and security rules aligned with application boundaries.

Core concepts

VPC (or VNet): your private address space

A VPC is a logically isolated network where you choose a CIDR range (like 10.0.0.0/16). Everything inside gets private IPs from that range. Think of it as “your data center network,” but with programmable routing and firewalls.

CIDR and subnets: how you carve up the space

CIDR notation defines how big an IP block is. A /16 is large; a /24 is much smaller. You split a VPC into subnets (often per Availability Zone) so you can control routing and isolation boundaries.

Concept What it controls Common default Typical pitfall
VPC CIDR Total IP space 10.0.0.0/16 Overlaps with on-prem/VPN ranges
Subnet CIDR Placement + routing domain /24 per AZ Too small (IP exhaustion) or inconsistent per env
Route table Where packets can go Public vs private route tables Subnet associated with the wrong route table
Security group Instance/ENI allowlist Least privilege Opening broad inbound “temporarily”

Public vs private subnets

A subnet becomes “public” when it has a route to an Internet Gateway and instances can receive public IPs. A “private” subnet has no direct route to the internet. Private subnets are where you put databases, internal services, and most app workloads.

Public subnet ≠ publicly accessible service

“Public subnet” describes routing. Whether a workload is reachable depends on security groups, load balancers, and public IPs. You can run workloads in a public subnet that are still not reachable if inbound rules block them.

Internet Gateway vs NAT

  • Internet Gateway (IGW): enables inbound/outbound internet for public subnets (when routes allow it).
  • NAT (gateway/instance): enables outbound-only internet from private subnets. NAT is for updates, package installs, external APIs — not for inbound traffic.

Security groups vs network ACLs

The two most common “it should work” blockers are security groups and NACLs. The key difference is how they behave.

Control Applies to State How failures look
Security group Instances/ENIs (workload-level) Stateful (return traffic allowed) Timeouts; only specific ports blocked
Network ACL Subnet boundary (network-level) Stateless (must allow both directions) Timeouts; can break “randomly” due to ephemeral ports

A simple mental model: the packet’s journey

When you’re stuck, walk the packet:

  1. Name: resolve hostname to IP (DNS).
  2. Source policy: does the source allow egress to that destination/port?
  3. Route: does the source subnet know where to send it (local, peering, NAT, IGW)?
  4. Destination policy: does the destination allow ingress from that source/port?
  5. App: is something listening and healthy behind the destination?
If it’s a timeout, assume a “drop”

Timeouts usually mean a firewall/routing drop. “Connection refused” usually means the port is reachable but the app is not accepting connections.

Step-by-step

Let’s build a “classic” layout: public entry + private app tier + private database tier. This pattern translates across providers (AWS VPC / Azure VNet / GCP VPC), even if the names differ.

Step 1: Plan CIDRs and tiers

A sane starter plan

  • One VPC per environment: dev, staging, prod (or at least separate prod)
  • Non-overlapping CIDRs across environments (helps with VPN/peering later)
  • Two or three AZs for availability
  • Subnets per AZ: public and private (optionally split private into app/db)

Design gotchas to avoid

  • Overlapping CIDRs with on-prem, other VPCs, or partner networks
  • Subnets too small (IP exhaustion happens faster than expected)
  • Mixing unrelated apps in one security boundary (“everything can talk to everything”)
  • Relying on “temporary” open rules that become permanent

Step 2: Route tables, IGW, and NAT (make traffic intentional)

Routing decides where packets can go. The simplest stable setup uses two route tables: a public route table (with a default route to an internet gateway) and a private route table (with a default route to NAT).

The minimum routing rules to remember

  • Public subnet: default route (0.0.0.0/0) → IGW
  • Private subnet: default route (0.0.0.0/0) → NAT (for outbound), no direct IGW
  • Local VPC CIDR: routed internally automatically (east-west inside the VPC)

Step 3: Security groups (least privilege that still works)

Security groups are where you encode “who can talk to whom.” A practical approach is to model your application layers: load balancer → app → database. Each layer gets its own security group.

Layered rules you can reuse

  • LB SG: inbound 80/443 from the internet (or from your CDN), outbound to app SG
  • App SG: inbound from LB SG on app port, outbound to DB SG on DB port and to required external services
  • DB SG: inbound only from app SG on DB port, no public inbound

Common rule mistakes

  • Opening DB ports to a broad CIDR instead of referencing the app SG
  • Forgetting egress restrictions (or forgetting that some platforms default to “allow all egress”)
  • Allowing SSH/RDP from the internet rather than from a controlled admin path
  • Mixing environments in one SG (“dev can reach prod”)

Step 4: A minimal VPC layout as code (example)

The snippet below shows a compact AWS-style layout: VPC, one public subnet, one private subnet, IGW, NAT, and security groups. Use it as a mental model even if you’re on another cloud — the pieces are the same. (Keep production setups multi-AZ; this is minimal on purpose.)

terraform {
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
}

provider "aws" {
  region = var.region
}

variable "region" { type = string, default = "eu-central-1" }
variable "vpc_cidr" { type = string, default = "10.20.0.0/16" }

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_support   = true
  enable_dns_hostnames = true
  tags = { Name = "unilab-net" }
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id
  tags = { Name = "unilab-igw" }
}

resource "aws_subnet" "public_a" {
  vpc_id                  = aws_vpc.main.id
  cidr_block               = "10.20.10.0/24"
  availability_zone        = "${var.region}a"
  map_public_ip_on_launch  = true
  tags = { Name = "public-a" }
}

resource "aws_subnet" "private_a" {
  vpc_id           = aws_vpc.main.id
  cidr_block        = "10.20.20.0/24"
  availability_zone = "${var.region}a"
  tags = { Name = "private-a" }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.igw.id
  }
  tags = { Name = "rt-public" }
}

resource "aws_route_table_association" "public_a" {
  subnet_id      = aws_subnet.public_a.id
  route_table_id = aws_route_table.public.id
}

resource "aws_eip" "nat" {
  domain = "vpc"
  tags = { Name = "eip-nat" }
}

resource "aws_nat_gateway" "nat" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public_a.id
  tags = { Name = "nat-a" }
  depends_on = [aws_internet_gateway.igw]
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat.id
  }
  tags = { Name = "rt-private" }
}

resource "aws_route_table_association" "private_a" {
  subnet_id      = aws_subnet.private_a.id
  route_table_id = aws_route_table.private.id
}

# Security groups (LB -> App -> DB pattern)
resource "aws_security_group" "app" {
  name   = "sg-app"
  vpc_id = aws_vpc.main.id

  ingress {
    description = "App port from LB"
    from_port   = 8080
    to_port     = 8080
    protocol    = "tcp"
    cidr_blocks = ["10.20.10.0/24"] # or reference an LB security group in real setups
  }

  egress {
    description = "Allow outbound (tighten per your needs)"
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = { Name = "sg-app" }
}

resource "aws_security_group" "db" {
  name   = "sg-db"
  vpc_id = aws_vpc.main.id

  ingress {
    description     = "DB from app"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }

  tags = { Name = "sg-db" }
}
Production note: NAT and availability zones

For production, place private subnets in multiple AZs and consider NAT per AZ to avoid cross-AZ dependencies. Keep the design consistent across environments so “it works in staging” means something.

Step 5: Troubleshoot with repeatable probes

When connectivity fails, use probes that tell you which layer is failing: DNS, TCP reachability, TLS, and HTTP. The commands below are safe and fast, and work from most Linux containers/VMs.

#!/usr/bin/env bash
set -euo pipefail

# Usage:
#   ./netcheck.sh example.internal 5432
#   ./netcheck.sh api.example.com 443
#
# These checks help you identify: DNS issue vs route/firewall drop vs app not listening.

HOST="${1:?host required}"
PORT="${2:-443}"

echo "== DNS resolution =="
( command -v dig >/dev/null && dig +short "$HOST" ) || ( getent hosts "$HOST" || true )

echo
echo "== TCP reachability (timeout means drop; refused means app not listening) =="
if command -v nc >/dev/null; then
  nc -vz -w 3 "$HOST" "$PORT" || true
else
  # Bash TCP check fallback
  timeout 3 bash -c "cat </dev/null >/dev/tcp/$HOST/$PORT" && echo "TCP: OK" || echo "TCP: FAIL"
fi

echo
echo "== TLS/HTTP probe (if applicable) =="
if [[ "$PORT" == "443" || "$PORT" == "8443" ]]; then
  curl -sS -m 5 -I "https://$HOST:$PORT" || true
else
  curl -sS -m 5 -I "http://$HOST:$PORT" || true
fi

echo
echo "== Trace route hint (may be blocked by firewalls) =="
( command -v traceroute >/dev/null && traceroute -n -w 1 -q 1 "$HOST" | head -n 8 ) || true

echo
echo "Next steps if it fails:"
echo "  - DNS fails: check private DNS/hosted zone/resolver rules"
echo "  - TCP timeout: check route table + SG/NACL (drops look like timeouts)"
echo "  - Refused: app/service not listening or wrong target port"

Step 6: Going beyond basics (when your network grows)

Once you have the basics stable, you’ll encounter “real-world” requirements: private access to managed services, hybrid connectivity, and multi-network topologies. Here are the most common next steps and when to use them:

Need Typical solution What to watch out for
Private access to cloud APIs (no public internet) VPC endpoints / PrivateLink / service endpoints DNS settings and endpoint policies
Connect two networks privately VPC peering / transit gateway / hub-spoke Overlapping CIDRs, route propagation, governance
On-prem connectivity VPN / Direct Connect / ExpressRoute / Interconnect Routing (BGP), MTU, firewall ownership
Traffic visibility Flow logs, load balancer logs, packet mirroring Cost and retention; make logs searchable

Bonus: intra-cluster networking still matters

Even with perfect VPC settings, platforms like Kubernetes add another layer (pod-to-pod traffic, network policies). The pattern is similar: default deny, then allow the specific flows your app needs.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-allow-db
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: ingress
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: db
      ports:
        - protocol: TCP
          port: 5432

Common mistakes

These are the issues that show up again and again in incident reviews. If you recognize your setup in one of these, you’ve found a great place to fix first.

Putting databases in public subnets

Public routing increases the chance of accidental exposure.

  • Fix: keep DBs in private subnets and allow inbound only from app security groups.
  • Fix: use a bastion/SSM/VPN for admin access instead of public inbound rules.

Wrong route table association

A subnet can be “private” only until it gets the public route table.

  • Fix: audit which subnets are associated with which route tables.
  • Fix: name things consistently: rt-public, rt-private, public-a, private-a.

Opening wide inbound rules “temporarily”

The temporary rule often becomes the permanent liability.

  • Fix: restrict inbound to known admin CIDRs, VPN, or bastion security groups.
  • Fix: add time-boxed change controls (tickets, expiry tags, automated cleanup).

Forgetting stateless rules (NACLs) and ephemeral ports

Stateless filtering can break return traffic if you only allow “the server port.”

  • Fix: prefer security groups for most filtering; keep NACLs simple if you use them.
  • Fix: when you must use NACLs, allow return traffic (ephemeral ports) explicitly.

Treating NAT as a generic “internet switch”

NAT is outbound only; it won’t make private workloads reachable from outside.

  • Fix: for inbound access, use a load balancer, public endpoint, or VPN/bastion pattern.
  • Fix: for private-only services, use internal load balancers and private DNS.

DNS mismatches (public vs private names)

The service is up, but clients are resolving the wrong address.

  • Fix: decide which names are public and which are private; document it.
  • Fix: verify resolver behavior from inside the VPC (not from your laptop).
Make networking changes boring

The easiest networks to run are the ones with few exceptions. Standardize subnets, standardize route tables, standardize security groups by tier, and keep naming consistent across environments.

FAQ

What’s the difference between a VPC and a subnet?

A VPC is the full private network and address space (CIDR range). A subnet is a slice of that space, usually tied to an AZ, where you attach routing and apply subnet-level controls. You place workloads into subnets to control exposure and traffic flow.

How do I know if a subnet is “public” or “private”?

A subnet is effectively public if its route table has a default route (0.0.0.0/0) to an Internet Gateway and instances can have public IPs. A subnet is private if it does not route directly to the IGW. Private subnets often route outbound through NAT instead.

What does NAT actually do?

NAT enables outbound internet access for private subnets by translating private source IPs to a public IP on the way out. NAT does not enable inbound internet access to private instances. For inbound access, use a load balancer, public service, or a private admin path (VPN/bastion/SSM).

Are security groups stateful, and why do I care?

In many clouds, security groups are stateful: if you allow inbound to a port, return traffic is allowed automatically. This makes them easier to reason about than stateless controls (like many NACLs), which require explicit rules in both directions.

Why do I get timeouts instead of explicit errors?

Firewalls and routing drops usually produce timeouts because packets are silently discarded. That’s why the troubleshooting flow starts with DNS, then probes TCP reachability, then checks route tables and security rules.

What’s a safe default security group strategy?

Use layered security groups aligned to your architecture: an inbound-facing layer (LB), an app layer, and a data layer. Allow only the required ports between layers, and avoid broad CIDR allowlists when you can reference security groups directly.

When should I use VPC endpoints / PrivateLink?

Use them when you want private access to managed services (object storage, secrets, APIs) without routing through the public internet. They reduce exposure and can simplify egress control — but still require careful DNS and policy configuration.

Cheatsheet

Keep this nearby for incident response. It’s designed to help you identify the failing layer fast and apply the right fix without guesswork.

Connectivity debugging order

  1. DNS: resolve the hostname from inside the VPC/subnet
  2. Port probe: is TCP reachable? (timeout vs refused)
  3. Routes: source subnet route table and destination route back
  4. Security groups: source egress + destination ingress for the port
  5. NACLs: if used, verify both directions (ephemeral ports)
  6. App/listener: health checks, binding address, target port

Default architecture checklist

  • Private subnets for app and DB tiers
  • Public subnets only for ingress components (LB, NAT)
  • Two route tables: public (IGW), private (NAT)
  • Layered security groups: LB → app → DB
  • Private DNS names for internal services
  • Flow logs enabled (with reasonable retention)

Quick reference: what each component is for

Component Primary purpose Typical mistake Fix
VPC/VNet Private network boundary + CIDR space Overlapping CIDRs Plan CIDRs upfront; document allocations
Subnet Placement + routing domain Wrong route table attached Audit associations; standardize naming
Route table Where traffic can go Missing default route (NAT/IGW) Add correct routes per subnet type
Internet Gateway Public inbound/outbound capability Accidental exposure via public routing Keep only ingress tier in public subnets
NAT Private subnet outbound internet Expecting inbound to work via NAT Use LB/VPN/bastion for inbound admin
Security group Workload allowlist (stateful) Wide inbound rules Least privilege; reference SGs
If you can draw it, you can debug it

When in doubt, sketch the flow: client → LB → app → DB. Then label the gates: route tables and security rules at each hop. That sketch becomes your runbook.

Wrap-up

Cloud networking basics aren’t about memorizing provider-specific names — they’re about understanding the path. Once you internalize DNS → routing → security groups/NACLs → listener, most connectivity issues become quick, mechanical fixes. Keep your network simple, private by default, and consistent across environments, and troubleshooting becomes a checklist instead of a mystery.

Next actions

  • Audit your subnets: which are public, which are private, and which route tables they use
  • Review security groups: ensure tiered access (LB → app → DB) and remove broad inbound rules
  • Enable flow logs (and set retention) so timeouts aren’t blind spots
  • Create a short incident runbook using the cheatsheet debugging order

Want to connect networking to the rest of your platform? The related posts cover Terraform structure, cost drivers like NAT/egress, and Kubernetes fundamentals that show up in real environments.

Quiz

Quick self-check (demo). This quiz is auto-generated for cloud / devops / networking.

1) A request to an internal service times out (hangs). What’s the most likely class of problem?
2) What makes a subnet “public” in practice?
3) What is NAT primarily used for in cloud networks?
4) Which statement about security groups is typically true?