Cyber security · Blue Team

SOC Skills for Developers: Detection Thinking in Plain English

Understand how defenders catch attacks—and code to help them.

Reading time: ~8–12 min
Level: All levels
Updated:

SOC skills for developers start with one habit: detection thinking. It’s the ability to turn “this could happen” into “here’s what we would see, where we would see it, and how we’d respond.” You don’t need to be a full-time analyst to help—if you can reason about systems, logs, and edge cases, you’re already halfway there.


Quickstart

Want the fastest path to being useful to a SOC (or just improving your own app’s security)? Do these in order. Each step is small, but together they turn “security vibes” into concrete, testable detections.

1) Pick one abuse story (a real one)

Don’t start with tools. Start with a scenario: “How would someone misuse this system?” Examples: credential stuffing, token theft, suspicious PowerShell, lateral movement, data exfil via unusual endpoints.

  • Write the attacker goal in one sentence
  • List 2–3 “steps” they’d likely take
  • Decide what you want to catch: early, mid, or late

2) Identify the evidence (telemetry)

Detection is evidence-based. If you can’t name the events, you can’t reliably alert. Choose the smallest set of logs you need to make a decision.

  • Which system emits the signal? (app, OS, identity, network)
  • Which fields matter? (user, IP, host, process, route)
  • What’s the time window? (1 min, 10 min, 24 hours)

3) Write one “boring” baseline

Most false positives come from not knowing what normal looks like. Baselines can be simple: “per user per hour,” “per host per day,” “per route per minute.”

  • Measure typical volume (counts) by user/host/service
  • Record known maintenance windows and batch jobs
  • List 5 normal reasons the event might happen

4) Ship the detection with a triage note

A detection without triage guidance becomes noise. Add just enough context so the on-call person can decide quickly.

  • Severity + rationale (why this is risky)
  • Top 3 questions to ask (what to check next)
  • Escalation path (who owns the affected system)
A developer-friendly win

Add structured security events in your app (JSON logs) for auth, permission changes, and sensitive actions. Most SOC tooling gets dramatically better when the app logs are consistent and rich.

Overview

A Security Operations Center (SOC) lives in the messy middle between “we have logs” and “we stopped an incident.” The core skill is not memorizing attacks—it’s detection thinking: turning messy telemetry into a small set of reliable signals that trigger the right response.

What you’ll learn

  • What “detections” actually are (and what they’re not)
  • How to design signals that survive real-world noise
  • How to write a rule/query that’s testable and maintainable
  • How developers can improve detection by instrumenting systems

What you’ll be able to do

  • Pick a scenario and map it to concrete evidence
  • Define a baseline + thresholds without guesswork
  • Ship a detection with context and a mini-runbook
  • Reduce false positives with simple tuning patterns
Plain-English definition

A detection is a hypothesis (“this behavior could be malicious”) plus evidence (specific logs/telemetry) plus a decision rule (how much evidence is enough to alert).

If you’ve ever debugged a flaky service, you already have transferable skills: you form hypotheses, collect signals, narrow scope, and decide what action to take. SOC work is similar—just with adversaries and higher consequences.

Core concepts

Let’s build a shared vocabulary. The goal is not jargon—it's clarity. When teams talk past each other (“alert” vs “incident” vs “finding”), detections get noisy and nobody trusts them.

SOC terms, translated for developers

Term Plain English Developer analogy
Telemetry The raw events you can observe (logs, traces, network flows, EDR) Metrics + logs + traces in production
Detection A rule/query/analytic that flags risky behavior A test that fails when an invariant is broken
Alert The notification created when a detection triggers An on-call page
Triage Fast decision: ignore, monitor, investigate, escalate Bug triage + incident response
False positive Alert fired, but nothing bad happened A flaky test
False negative Bad thing happened, but you didn’t alert A missing test coverage gap
Runbook Steps to verify and respond Playbook / SOP / “what to do at 3am”

1) Signal vs noise (and why “more alerts” is worse)

A SOC’s most limited resource is attention. If detections are noisy, analysts develop “alert fatigue” and start ignoring them. Good detection thinking aims for high signal: alerts that are rare, explainable, and actionable.

High-signal alerts usually have

  • A clear behavior (not just a single indicator)
  • Context (who/what/where/when)
  • A bounded time window
  • A next step (“check X, then do Y”)

Noisy alerts usually have

  • Vague patterns (“any admin action”)
  • No baseline (“how often is normal?”)
  • No ownership (“who fixes this?”)
  • Missing entity identifiers (no user/host/service)

2) Behavior > indicators (most of the time)

Indicators of compromise (IoCs) like IPs, hashes, or domains can be useful, but they expire fast. Behavioral detections focus on what happened (e.g., unusual authentication patterns, suspicious process trees), which tends to remain relevant longer.

The “one string match” trap

Detections that trigger on a single keyword or one-off IoC are easy to bypass and often produce false positives. Prefer combinations: behavior + context + threshold.

3) Baselines: “normal” is a feature

Baselines don’t have to be fancy. For many detections, a simple per-entity baseline is enough: “per user,” “per host,” “per API key,” “per service account.” This is how you avoid flagging batch jobs, legitimate scanners, or high-volume users.

A practical baseline recipe

  • Group by an entity (user/host/service)
  • Measure normal volume over a time window
  • Choose a threshold that’s rare in “normal” data
  • Review the first week of alerts and tune with evidence

4) Detection quality metrics that matter

You don’t need perfect math to improve detection quality. You need feedback loops: how often you page people, how often it’s real, and how quickly you can close an alert.

Metric What it tells you How developers help
Precision (actionable rate) Of alerts fired, how many mattered? Add richer app logs; reduce ambiguous events
MTTA/MTTR How fast you detect and resolve Add ownership + runbooks + reliable IDs
Coverage Which attack paths you can’t see Instrument auth, admin, and data access actions
Noise budget How many alerts your team can handle Batch similar findings; add thresholds & allowlists
The simplest mental model

Think of detections like unit tests for security invariants: if you can’t explain the invariant, you can’t test it. If you can’t reproduce the signal, you can’t trust it.

Step-by-step

Here’s a practical, repeatable workflow for building detections that don’t crumble in production. The structure is intentionally “developer-shaped”: define a requirement, specify the inputs, implement, test, iterate.

Step 1 — Write the scenario as a timeline

Choose one scenario and write the attacker’s steps. Keep it simple. Your goal is to identify observable events, not write a novel.

Example timeline: suspicious PowerShell execution

  • Attacker gains initial access (phish, exploit, stolen creds)
  • They run PowerShell with encoded commands to avoid easy visibility
  • They download a second-stage payload
  • They persist or move laterally

Step 2 — Decide what evidence you need (and where it lives)

Make a tiny “data contract” for the detection: which events and fields are required. This is where developers shine—because you can improve the system to emit the right signals.

Evidence sources (common)

  • Identity: logins, MFA prompts, token creation
  • OS: process start, service creation, scheduled tasks
  • Network: DNS, proxy, outbound connections
  • App: privileged actions, data exports, role changes

Minimum fields that unlock correlation

  • Timestamp (with timezone)
  • Actor identity (user/service account)
  • Host/workload identity (hostname, pod, instance)
  • Request or process context (command line, route, action)
  • Network context (source IP, user agent, destination)
Developer superpower: make telemetry boring

Security teams love consistent, structured logs more than clever rules. If your events have stable names and predictable fields, detection logic becomes simpler and more reliable.

Step 3 — Implement the rule (start simple, then add context)

Start with the smallest rule that captures the behavior. Then add guardrails: thresholds, allowlists for known-good automation, and context that makes triage faster.

Code example 1 — A Sigma-style rule (portable idea)

Sigma is a generic detection rule format. Even if you don’t use Sigma directly, the structure is useful: define the log source, match the behavior, add a clear condition, and document it.

title: Suspicious PowerShell EncodedCommand
id: 2c85b6d2-6f6d-4e8b-9d6d-ps-encodedcommand
status: experimental
description: Detects PowerShell launched with -EncodedCommand, often used to obfuscate payloads
logsource:
  category: process_creation
  product: windows
detection:
  selection:
    Image|endswith:
      - '\powershell.exe'
      - '\pwsh.exe'
    CommandLine|contains:
      - ' -enc '
      - ' -encodedcommand '
  condition: selection
falsepositives:
  - Legitimate admin scripts that use encoded commands
level: medium
tags:
  - attack.execution
  - attack.t1059.001

Tuning tip: if this fires too often in your environment, add an allowlist for known management tools, or require additional context (network download, unusual parent process, rare host/user combination).

Step 4 — Add a baseline and thresholds (so it survives reality)

Many real threats look like “a lot of small things,” not one dramatic event. This is where thresholds, time windows, and per-entity baselines shine. A classic example is suspicious authentication: many failures then a success.

Code example 2 — Query pattern: many failures then a success

This query shape works across tools: group by user/IP, count failures in a window, and join to a later success. In a Microsoft-style environment this is often written in KQL.

-- KQL-like pattern (presented as SQL-style for readability)
-- Goal: detect a burst of failed logins followed by a success (possible credential stuffing or brute force)
let window = 10m;
let minFails = 10;
FailedLogons
| where TimeGenerated > ago(24h)
| summarize FailCount=count(), FirstFail=min(TimeGenerated), LastFail=max(TimeGenerated)
    by UserPrincipalName, IPAddress
| where FailCount >= minFails and (LastFail - FirstFail) <= window
| join kind=inner (
    SuccessfulLogons
    | where TimeGenerated > ago(24h)
    | project UserPrincipalName, IPAddress, SuccessTime=TimeGenerated
) on UserPrincipalName, IPAddress
| where SuccessTime between (LastFail .. LastFail + 15m)
| project SuccessTime, UserPrincipalName, IPAddress, FailCount, FirstFail, LastFail

Tuning tip: make it per-tenant realistic. Raise or lower minFails, and consider excluding known VPN egress IPs, corporate proxies, or verified password managers to reduce false positives.

Step 5 — Ship it with a mini-runbook (don’t make the SOC guess)

A “good” detection is not just a match. It’s an alert that helps someone decide quickly. Include the answers to: Is this risky? and What should I do next?

What to include in the alert

  • Who: user/service account + role
  • Where: host/workload + source IP + geo (if available)
  • What: action/command/route + key parameters
  • When: timestamps + time window
  • Why: one sentence risk statement

Mini-runbook: first 3 checks

  • Was the actor expected? (on-call rotation, admin task, deploy)
  • Do we see related events? (token creation, new device, new process)
  • Can we contain safely? (disable account, isolate host, revoke token)

Step 6 — Test and tune like software

Treat detections like code: version them, test them, and track changes. Your detection should be able to answer: “Why did this alert fire?” and “What changed since last week?”

Code example 3 — Lightweight log enrichment for better triage

Even small enrichment can dramatically improve triage. This example reads JSON log lines, tags suspicious auth bursts, and attaches ownership metadata (team/service) from a local mapping.

import json
from collections import defaultdict
from datetime import datetime, timedelta, timezone

# Example enrichment map (in reality: load from CMDB, service catalog, or config repo)
SERVICE_OWNER = {
    "payments-api": {"team": "Payments", "oncall": "payments-oncall"},
    "admin-portal": {"team": "Platform", "oncall": "platform-oncall"},
}

def parse_ts(ts: str) -> datetime:
    # Expect ISO-8601 like "2026-01-09T12:34:56Z"
    if ts.endswith("Z"):
        ts = ts[:-1] + "+00:00"
    return datetime.fromisoformat(ts).astimezone(timezone.utc)

def main(path: str) -> None:
    # Inputs: JSONL where each line contains:
    # { "ts": "...", "event": "auth_failed|auth_success", "user": "...", "ip": "...", "service": "..." }
    failures = defaultdict(list)

    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            e = json.loads(line)
            ts = parse_ts(e["ts"])
            key = (e.get("user"), e.get("ip"), e.get("service"))

            if e.get("event") == "auth_failed":
                failures[key].append(ts)

            if e.get("event") == "auth_success":
                # Look back 10 minutes for a burst of failures
                window_start = ts - timedelta(minutes=10)
                recent = [t for t in failures.get(key, []) if t >= window_start]
                if len(recent) >= 10:
                    owner = SERVICE_OWNER.get(e.get("service"), {"team": "unknown", "oncall": "unknown"})
                    finding = {
                        "type": "suspicious_auth_burst_then_success",
                        "user": e.get("user"),
                        "ip": e.get("ip"),
                        "service": e.get("service"),
                        "fail_count_10m": len(recent),
                        "success_ts": ts.isoformat().replace("+00:00", "Z"),
                        "owner_team": owner["team"],
                        "owner_oncall": owner["oncall"],
                    }
                    print(json.dumps(finding))

if __name__ == "__main__":
    # Usage: python enrich_findings.py auth_events.jsonl
    import sys
    if len(sys.argv) != 2:
        raise SystemExit("Usage: python enrich_findings.py <path-to-jsonl>")
    main(sys.argv[1])

The point isn’t the script—it’s the pattern: attach ownership and context early so triage is fast and consistent.

What “done” looks like

A detection is production-ready when it has a stable data contract, a baseline/threshold rationale, an owner, and a mini-runbook. Fancy scoring can come later.

Common mistakes

Most “bad detections” aren’t bad because the author is inexperienced. They’re bad because the system and the process don’t support reliable signals. Use these pitfalls as a debugging checklist.

Mistake 1 — Alerting on a single event with no context

“PowerShell ran” or “admin action happened” is rarely enough. You’ll page people for normal work.

  • Fix: add baseline + threshold + time window.
  • Fix: enrich with user role, host criticality, and rare combinations.

Mistake 2 — No clear “what to do next”

If an alert doesn’t reduce uncertainty, it’s just a notification.

  • Fix: add a mini-runbook: 3 checks + escalation owner.
  • Fix: include the key fields in the alert payload.

Mistake 3 — Ignoring “normal” automation

CI jobs, scanners, and maintenance tasks will dominate your alerts if you don’t model them explicitly.

  • Fix: maintain allowlists with owners and expiration dates.
  • Fix: prefer per-entity baselines (service accounts behave differently).

Mistake 4 — Shipping detections without versioning

If you can’t explain what changed, you can’t tune reliably. “It got noisy” becomes a mystery.

  • Fix: version rules/queries like code (PRs, reviews, changelogs).
  • Fix: track outcomes: true positive, benign, needs follow-up.

Mistake 5 — Using “severity” as a vibe

If everything is high severity, nothing is. Severity should reflect impact and confidence.

  • Fix: define severity with a table: impact × confidence.
  • Fix: create a “low-sev, high-volume” lane for trends.

Mistake 6 — Logging secrets while “adding visibility”

More logs aren’t always safer. Sensitive fields can create compliance and breach risk.

  • Fix: redact tokens, passwords, and sensitive payloads.
  • Fix: log identifiers and outcomes, not raw secrets.
A harsh truth (that helps)

If your telemetry is inconsistent, detection engineering becomes guesswork. Invest in data quality (schemas, stable event names, required fields) and your detections will improve faster than any “new tool.”

FAQ

What are the most useful SOC skills for developers to learn first?

Start with detection thinking: mapping scenarios to evidence, and shipping detections with context. Practically, that means: knowing your logs, understanding baselines, writing simple queries, and documenting triage steps. You don’t need to memorize every attack technique to be helpful.

Which app events should I log to make detection easier?

Log the security-relevant decisions your app already makes. Focus on: authentication outcomes, MFA events, permission/role changes, sensitive data access, configuration changes, and high-risk admin actions. Use structured JSON with stable event names and include IDs (user, session, request, actor, resource).

  • Good: “user X exported report Y from IP Z”
  • Avoid: raw tokens/passwords, full request bodies, unbounded PII dumps

How do I reduce false positives without missing real attacks?

Don’t “turn off” detections blindly. Reduce noise by adding context and tightening the decision rule: use per-entity baselines, time windows, and simple allowlists for known automation. Then review the first week of alerts and tune with evidence (what was normal vs suspicious).

Do I need machine learning to do good detection?

No. Most high-value detections are still built from rules and baselines because they’re explainable and fast to tune. ML can help in specific cases (anomaly detection, clustering, ranking), but it’s not a shortcut. If the underlying telemetry is messy, ML will learn your mess.

What’s the difference between a detection and an incident?

A detection is a signal (a rule/query/analytic) that suggests risk. An incident is a confirmed security event that needs response coordination. Good detections help you decide quickly whether something is an incident.

How can I test detections safely?

Test in layers: first on historical logs, then in a staging environment, and only then in production with guardrails. Use test accounts, restrict blast radius, and document expected signals. Treat tests like change management: predictable, reversible, and observable.

Cheatsheet

Scan this when you’re writing or reviewing a detection. If you can’t check most boxes, the alert will probably be noisy.

Detection design checklist

  • Scenario described as steps (timeline)
  • Telemetry sources identified (where the evidence lives)
  • Required fields listed (data contract)
  • Time window defined (e.g., 10 minutes)
  • Baseline chosen (per user/host/service)
  • Threshold rationale written (why this number?)
  • Known-good automation considered (allowlists)

Alert payload & triage checklist

  • Who/what/where/when included in the alert
  • Severity = impact × confidence (not a guess)
  • Owner/team included (who can fix/verify)
  • Mini-runbook: first 3 checks
  • Clear escalation path
  • Links to relevant dashboards/log views (if available)

Tuning checklist (first week)

  • Review every alert outcome for 3–7 days
  • Tag outcomes: benign / suspicious / confirmed
  • Identify top false-positive causes
  • Add allowlists with owners + expiration
  • Adjust thresholds based on observed baselines
  • Document what changed and why

Developer logging checklist

  • Use structured JSON logs for security events
  • Stable event names (no ad-hoc strings)
  • Include IDs: user, session, request, resource
  • Capture outcomes (success/failure) with reasons
  • Redact secrets and sensitive payloads
  • Version schemas when fields change
Cheat code for signal

If you must choose one improvement: add context (who/where/ownership) and add a baseline. Those two changes usually cut noise more than any advanced technique.

Wrap-up

Detection thinking is a practical skill: turn scenarios into evidence, turn evidence into rules, and ship rules with context. The biggest unlock for SOC skills for developers is realizing you can improve detection quality by improving the system: better logs, stable schemas, clear ownership, and runbooks that reduce decision time.

Your next 30 minutes

  • Pick one scenario from Quickstart
  • Write the timeline and list required fields
  • Check if your app actually emits those fields
  • Draft a baseline + threshold and note how you’ll tune it
  • Add a mini-runbook so someone else can triage it
If you’re building a product

Treat detection thinking as part of “definition of done” for risky features (auth, admin, data export). Shipping a feature without observability is like shipping code without tests.

Want more? Jump to the Cheatsheet when you’re implementing, and use the Quiz as a quick self-check. The related posts below go deeper on threat modeling, secrets, dependency risks, and modern authentication.

Quiz

Quick self-check (demo). This quiz is auto-generated for cyber / security / blue.

1) In plain English, what is “detection thinking”?
2) Which change most reliably reduces false positives for many detections?
3) Why are structured, consistent app logs valuable for a SOC?
4) What should a good alert include beyond “it matched a rule”?