I Built an API to Troubleshoot HTTP Status Codes

I Built an API to Troubleshoot HTTP Status Codes
Photo by Markus Spiske / Unsplash

The Foundation of Everything We Do

APIs are everywhere. Every time you check your phone, stream a video, or order food, dozens of APIs are working behind the scenes. Modern software isn't built in isolation anymore—it's built by connecting services through APIs.

Your mobile app talks to a backend API. That backend talks to a payment API, a shipping API, an inventory API. Those APIs talk to other APIs. It's APIs all the way down.

This isn't just a technical curiosity. This is how the internet works now.

And when one of these APIs returns the wrong status code? When it says "500 Internal Server Error" instead of "409 Conflict"? Someone somewhere wastes hours troubleshooting the wrong thing.

The Disconnect

As a Support Engineering Manager at an API-first SaaS company, I help customers troubleshoot integrations daily. My team handles API authentication failures, webhook delivery issues, rate limiting problems, data validation errors—you name it.

I can read stack traces, trace distributed requests, analyze webhook failures. I tell my team: "Check if it's 4xx or 5xx" and they know what that means.

But here's what I realized: I was pattern-matching more than truly understanding.

"It's a 409." Okay, that's a conflict. Why 409 and not 400?

"It's a 422." Unprocessable entity. What makes something unprocessable versus just bad?

"It's a 500." Internal server error. But what if the customer sent data that crashed our system—is that their fault (4xx) or ours (5xx)?

I could troubleshoot these issues. I could guide my team through triage. But somewhere deep down, I realized I was operating on heuristics and experience rather than true understanding.

Why does a duplicate SKU return 409 and not 400?

What's the actual difference between 500 and 503?

When should validation return 422 versus 400?

I thought I knew. I sort of knew. But I couldn't explain it with confidence.

So I decided to build it myself.

Enter: The Product Catalog API

The project started simple: build a REST API from scratch using FastAPI and Test-Driven Development.

The specs were modest:

  • Create, read, update, delete products
  • Handle common validation scenarios
  • Implement rate limiting
  • Return proper status codes for every situation

But the learning? That's been immense.

Because here's what happens when you write tests first: You have to decide what the right behavior is before you implement it.

python

def test_create_duplicate_sku_returns_409(client):
    # Create a product
    client.post('/products', json={"sku": "WIDGET-001", ...})
    
    # Try to create the same SKU again
    response = client.post('/products', json={"sku": "WIDGET-001", ...})
    
    # What should happen? 400? 409? 422?
    assert response.status_code == ???  # You have to choose

Suddenly, I couldn't hide behind "well, it depends" or "let me check the docs." I had to commit. Is this 409 or not?

Every test forced me to understand not just what status codes mean, but why we have different ones and when to use each.

You can follow along or check out the code here: GitHub - task-api

This post is what I learned along the way.


The Rule I Thought I Understood

Here's the simple rule everyone knows:

4xx = The CLIENT did something wrong
5xx = The SERVER did something wrong

Easy, right?

But when you actually implement this in code, the edge cases appear:

Scenario 1: A customer sends a price of -10.00 (negative). My API crashes because I didn't validate it. Is that 4xx (their bad data) or 5xx (my crash)?

Scenario 2: A customer tries to create a product that already exists. Is that 400 (bad request), 409 (conflict), or 422 (validation error)?

Scenario 3: My database connection pool is exhausted. The customer's request was perfect. Is that 500 (generic error) or 503 (service unavailable)?

Reading the HTTP spec didn't help. The definitions are precise but abstract. I needed to feel the difference by implementing it.

Let me show you what I learned, scenario by scenario.


Learning Through Building: The 4xx Scenarios

Every 4xx code I implemented taught me something new about what "client error" actually means.

404 Not Found: The Straightforward One

The test I wrote:

python

def test_get_nonexistent_product_returns_404(self, client):
    """GET should return 404 when product doesn't exist"""
    response = client.get('/products/DOES-NOT-EXIST')
    
    assert response.status_code == 404
    assert 'not found' in response.json()['detail'].lower()

What I learned:

404 is the easiest to understand: "We understood your request perfectly, but the resource you're asking for doesn't exist."

It's not our fault. It's not their fault. The SKU just isn't in our system.

The implementation:

python

@app.get('/products/{sku}')
def get_product(sku: str):
    if sku not in products_db:
        raise HTTPException(
            status_code=404,
            detail=f"Product with SKU '{sku}' not found"
        )
    return products_db[sku]

For support engineers: When you see 404, ask the customer: "Can you verify that SKU exists in our system?" Nine times out of ten, they have a typo or are looking at the wrong environment.


409 Conflict: The "Aha" Moment

This is where things got interesting.

The scenario: Customer tries to create a product with a SKU that already exists.

The test:

python

def test_create_duplicate_sku_returns_409(self, client):
    # Create first product
    client.post('/products', json={
        "sku": "WIDGET-001",
        "name": "Blue Widget",
        "price": 29.99
    })
    
    # Try to create same SKU again
    response = client.post('/products', json={
        "sku": "WIDGET-001",
        "name": "Red Widget",
        "price": 19.99
    })
    
    assert response.status_code == 409
    assert 'already exists' in response.json()['detail'].lower()

The question I had to answer: Why not 400 (Bad Request)?

Here's what clicked for me:

400 = "Your request is malformed or unclear"

  • Invalid JSON syntax
  • Missing required Content-Type header
  • Garbage data we can't even parse

409 = "Your request is perfectly clear, but it conflicts with the current state"

  • The SKU already exists
  • The format is correct, the data is valid, but it creates a conflict

The implementation:

python

@app.post('/products')
def create_product(product: Product):
    if product.sku in products_db:
        raise HTTPException(
            status_code=409,
            detail=f"Product with SKU '{product.sku}' already exists"
        )
    
    # Create new product...

For support engineers: When you see 409, the customer is trying to create something that already exists. Guide them to either:

  1. Use a different identifier (SKU, email, username, etc.)
  2. Update the existing resource instead of creating a new one
  3. Check their data source for duplicates

The key insight: Their request was valid but conflicted with existing data. That distinction matters.


422 Unprocessable Entity: The Confusing One

This was the status code I'd been using wrong for years.

The scenario: Customer sends a negative price.

The test:

python

def test_create_product_with_negative_price_returns_422(self, client):
    response = client.post('/products', json={
        "sku": "WIDGET-002",
        "name": "Cheap Widget",
        "price": -10.00,  # Invalid!
        "description": "This shouldn't work"
    })
    
    assert response.status_code == 422

The confusion: Isn't this the same as 400? Both are "bad data," right?

What building it taught me:

Status CodeMeaningExample
400 Bad RequestRequest format is wrongInvalid JSON, missing Content-Type header
422 Unprocessable EntityRequest format is correct, but values violate business rulesNegative price, invalid email format, string too long

The distinction:

  • 400 = "We can't even parse your request"
  • 422 = "We parsed it fine, but the values don't make sense"

Here's where FastAPI really shined. I could define validation rules declaratively:

python

from pydantic import BaseModel, Field

class Product(BaseModel):
    sku: str
    name: str
    price: float = Field(gt=0, description="Price must be greater than 0")
    description: str

FastAPI automatically returns 422 when validation fails. I didn't have to think about it!

The actual error response:

json

{
  "detail": [
    {
      "loc": ["body", "price"],
      "msg": "ensure this value is greater than 0",
      "type": "value_error.number.not_gt"
    }
  ]
}

For support engineers: When you see 422, look at the detail array. It tells you:

  • Which field failed (loc)
  • Why it failed (msg)
  • What type of validation error (type)

This is gold for troubleshooting. You can tell the customer exactly which field to fix and why.

The key insight: 422 means "your data format is correct, but the values violate our business rules." It's specific, actionable, and tells you exactly what's wrong.


429 Too Many Requests: The Rate Limiter

This one forced me to think about system protection versus user fault.

The scenario: Customer is hammering the API too fast.

The test:

python

def test_too_many_requests_returns_429(self, client):
    # Make 5 requests rapidly (our limit)
    for i in range(5):
        response = client.get(f'/products/TEST-{i}')
        assert response.status_code in [200, 404]
    
    # 6th request should be rate limited
    response = client.get('/products/TEST-6')
    assert response.status_code == 429

The question: Is this really a client error (4xx)? Isn't rate limiting about protecting our infrastructure?

What I learned:

Yes, it's still 4xx because the action causing the problem is on the client side. They're sending too many requests. The fix is on their end: slow down.

The implementation was trickier than I expected:

python

from datetime import datetime, timedelta

request_timestamps = []

def rate_limit_check():
    global request_timestamps
    now = datetime.now()
    
    # Remove timestamps older than 1 second
    request_timestamps = [
        ts for ts in request_timestamps 
        if now - ts < timedelta(seconds=1)
    ]
    
    # Check if limit exceeded
    if len(request_timestamps) >= 5:
        raise HTTPException(
            status_code=429,
            detail="Rate limit exceeded: Maximum 5 requests per second"
        )
    
    # Add current request
    request_timestamps.append(now)

This is a sliding window rate limiter. It tracks timestamps and only counts requests in the last second.

For support engineers: When you see 429:

  1. The customer is hitting the API too fast
  2. Tell them to add delays between requests (e.g., 200ms)
  3. If they're doing bulk operations, suggest batching
  4. Check if they have retry logic that's creating a loop

The key insight: Rate limiting is about protecting the system, but the error is still client-side because they need to change their behavior.


The Big Shift: Understanding 5xx Errors

Here's where my understanding really deepened.

4xx errors are about what the client did wrong. 5xx errors are about what we did wrong (or what went wrong on our side).

But there's nuance here that I missed for years.

500 Internal Server Error: The Nightmare

The scenario: Our database connection fails.

This was harder to test because I needed to simulate a failure. Enter: mocking.

The test:

python

def test_database_connection_failure_returns_500(self, client):
    from unittest.mock import patch
    
    # Mock the database connection to fail
    with patch('app.api.get_database_connection') as mock_db:
        mock_db.side_effect = Exception("Database connection failed")
        
        response = client.get('/products/ANY-SKU')
        
        assert response.status_code == 500
        assert 'internal' in response.json()['detail'].lower()

What I learned:

The customer's request was perfect. The SKU format was correct, the HTTP method was correct, everything they did was right.

But our database failed. That's not their fault. That's a bug or infrastructure issue on our end.

The implementation required wrapping everything in try/except:

python

@app.get('/products/{sku}')
def get_product(sku: str):
    try:
        db = get_database_connection()  # Might raise exception!
        
        if sku not in db:
            raise HTTPException(404, "Product not found")
        
        return db[sku]
        
    except HTTPException:
        raise  # Re-raise intentional errors (404, etc.)
        
    except Exception as e:
        # Log the real error internally
        logger.error(f"Database error: {e}")
        
        # Return generic message to customer
        raise HTTPException(
            status_code=500,
            detail="Internal server error: Unable to process request"
        )

The critical pattern:

  1. Catch HTTPException separately and re-raise it (these are intentional errors like 404)
  2. Catch all other Exceptions (these are bugs)
  3. Log the real error internally (for debugging)
  4. Return a generic message to the customer (don't expose internals)

For support engineers: When you see 500:

  • This is our bug, not theirs
  • Don't troubleshoot the customer's code
  • Escalate to engineering immediately
  • Ask the customer: "When did this start? Can you share the exact request that failed?"
  • Check if it's happening to multiple customers (system-wide issue) or just one (data-specific issue)

The key insight: 500 means something unexpected broke on our end. The customer can't fix this. We need to fix it.


503 Service Unavailable: The Known Issue

This is where I finally understood why we have both 500 and 503.

The scenario: System is in maintenance mode.

The test:

python

def test_service_maintenance_returns_503(self, client):
    from unittest.mock import patch
    
    # Simulate maintenance mode
    with patch('app.api.is_maintenance_mode') as mock:
        mock.return_value = True
        
        response = client.get('/products/ANY-SKU')
        
        assert response.status_code == 503
        assert 'unavailable' in response.json()['detail'].lower()

The distinction I missed:

CodeMeaningContextCustomer Action
500Unexpected errorWe didn't know this would happen (bug)Wait for us to fix
503Expected unavailabilityWe know the service is down (maintenance, overload)Retry later

500 = "Something broke unexpectedly - we're investigating"

503 = "System is temporarily down - we know about it - try again later"

The implementation:

python

MAINTENANCE_MODE = False

def is_maintenance_mode():
    return MAINTENANCE_MODE

@app.get('/products/{sku}')
def get_product(sku: str):
    if is_maintenance_mode():
        raise HTTPException(
            status_code=503,
            detail="Service temporarily unavailable due to maintenance. Please try again later."
        )
    
    # Rest of the function...

For support engineers: When you see 503:

  1. Check your status page (should show planned maintenance)
  2. Tell customer: "System is temporarily down for maintenance/upgrade"
  3. Provide an ETA if available
  4. This is not an emergency escalation (unlike 500)
  5. If there's no planned maintenance, escalate - might be overload/outage

The key insight: 503 is transparent about known issues. 500 is for unknown issues (bugs). Both are server-side, but they communicate different things about what we know and when it'll be fixed.


What This Changed for My Team

Before this project, when my team encountered API errors, the conversation went like this:

Support Engineer: "Customer is getting an error."

Me: "What's the status code?"

Support Engineer: "Um... let me ask them."

[20 minutes later]

Support Engineer: "They said it's a 500."

Me: "Okay, that's a server error. Escalate to engineering."

Support Engineer: "But they say it only happens with certain data..."

Me: "Hmm, might be validation then. Can you get the actual error message?"

[Another 20 minutes of back-and-forth]


After this project, I gave my team a simple decision tree:

See a status code
├─ 2xx (200-299)
│ └─ Success! Customer might be misinterpreting the response

├─ 4xx (400-499)
│ ├─ 400 → Request format is wrong (malformed JSON, missing headers)
│ ├─ 404 → Resource doesn't exist (verify SKU/ID)
│ ├─ 409 → Conflict (duplicate data, check for existing resource)
│ ├─ 422 → Validation failed (check error details for which field)
│ └─ 429 → Rate limited (slow down requests)

└─ 5xx (500-599)
├─ 500 → Unexpected bug → ESCALATE IMMEDIATELY
└─ 503 → Known outage/maintenance → Check status page

The result:

  • Triage time dropped from 20+ minutes to under 5 minutes
  • Escalations dropped by ~40% (team handles 4xx errors themselves)
  • Customer satisfaction improved (faster, more accurate responses)

But more importantly: My team understands why different codes exist, not just what they mean.

They can explain to customers:

  • "This is a 409 because you're trying to create a duplicate - here's how to fix it"
  • "This is a 422 because your price field is invalid - here's what we expect"
  • "This is a 500 which means it's our bug - I'm escalating this immediately"

That confidence comes from understanding.


The Patterns I Now See Everywhere

After building this API, I started noticing patterns in how other APIs use status codes. And honestly? Most get it wrong.

Common mistakes I see:

Mistake 1: Returning 500 for validation errors

json

POST /products
{"sku": "ABC", "price": -10}

→ 500 Internal Server Error

Why it's wrong: Validation failure is not an unexpected error. It should be 422.

Why it happens: Lazy error handling - catching all exceptions and returning 500.


Mistake 2: Returning 200 for errors

json

POST /products
{"sku": "ABC", "price": -10}

→ 200 OK
{
  "success": false,
  "error": "Invalid price"
}

Why it's wrong: The request failed! It should return a 4xx code, not 200.

Why it happens: Developers encode status in the response body instead of using HTTP status codes properly.

Why it's bad: API clients can't use standard HTTP libraries to detect errors. They have to parse every response body.


Mistake 3: Generic error messages

json

→ 400 Bad Request
{
  "error": "Invalid request"
}

Why it's wrong: This tells you nothing! Which field is invalid? Why?

What it should be:

json

→ 422 Unprocessable Entity
{
  "detail": [
    {
      "loc": ["body", "price"],
      "msg": "Price must be greater than 0",
      "type": "value_error.number.not_gt"
    }
  ]
}

This tells you:

  • What failed (price)
  • Where it failed (request body)
  • Why it failed (must be > 0)
  • How to fix it (send a positive number)

What Good Error Design Looks Like

After implementing 14 different test scenarios, here's what I learned makes a good error response:

1. Use the right status code

Don't return 500 for everything. Use the specific code that describes what went wrong.

2. Include the specifics in the message

Bad: "Product not found"

Good: "Product with SKU 'WIDGET-001' not found"

The customer knows exactly what they sent and can verify it.

3. Be consistent in structure

All your errors should follow the same format:

json

{
  "detail": "Human-readable error message"
}

Or for validation errors:

json

{
  "detail": [
    {
      "loc": ["body", "field_name"],
      "msg": "What's wrong",
      "type": "error_type"
    }
  ]
}

Consistency lets customers (and support teams) build tools to parse errors automatically.

4. Don't expose internals in 5xx errors

Bad:

json

{
  "error": "SQLException: Connection refused at line 47 in database.py"
}

Good:

json

{
  "detail": "Internal server error: Unable to process request"
}

Log the real error internally. Give customers a generic message. Security matters.


My Biggest Takeaways

Building this API taught me more than reading documentation ever could. Here's what stuck:

1. The 4xx/5xx split is about fault, not severity

A 422 validation error isn't "less serious" than a 500. It's just a different kind of error—one the customer can fix versus one we need to fix.

For support teams, this distinction is everything. It tells you who needs to take action.

2. Specificity helps everyone

Using 409 instead of generic 400 saves everyone time. The customer knows exactly what conflicted. The support engineer knows it's a duplicate issue. Engineering knows it's not a validation bug.

More specific status codes = faster resolution.

3. Good error messages are half the battle

The status code tells you the category. The message tells you the details.

409 Conflict: "Product with SKU 'WIDGET-001' already exists"

That's everything you need to troubleshoot. No logs needed. No back-and-forth. Just fix the duplicate SKU.

4. FastAPI makes this easier (but you still need to understand it)

FastAPI's automatic validation and Pydantic models handle a lot of the heavy lifting. But you still need to know when to use 409 vs 422, when to catch exceptions and return 500, when to check for conflicts.

The framework can't make those decisions for you.

5. TDD forced me to think through every scenario

Writing tests first meant I couldn't be vague. I had to commit:

  • "This scenario returns 409, not 400"
  • "This scenario returns 500, not 503"
  • "This error message includes the SKU, not just 'not found'"

Each test was a micro-decision that built my understanding.


One More Thing: The 418 Easter Egg

If you've read this far, you deserve to know about the best HTTP status code of all: 418 I'm a teapot.

Yes, it's real. Yes, it's in the official HTTP spec. No, you shouldn't use it in production (but you absolutely should know about it).

The story: In 1998, the IETF published RFC 2324 as an April Fools' joke called "Hyper Text Coffee Pot Control Protocol" (HTCPCP). It defined status code 418 for when a teapot is asked to brew coffee.

python

@app.get('/coffee')
def brew_coffee():
    raise HTTPException(
        status_code=418,
        detail="I'm a teapot. I cannot brew coffee."
    )

Try it yourself: Google "418 I'm a teapot" and watch what happens to the search page.

The joke became so beloved that it's actually implemented in many frameworks. FastAPI supports it. Express supports it. Django supports it.

Why it matters: Sometimes APIs can have personality. A 418 response on a /coffee endpoint shows that the developers care about the craft, have a sense of humor, and know their HTTP history.

For support engineers? If you ever encounter 418 in the wild, you'll know: someone's having fun. (And also, you're probably hitting a test endpoint.)


Want to add 418 to your API? Of course you do. Here's the test:

python

def test_teapot_refuses_to_brew_coffee(self, client):
    """Test the most important status code of all"""
    response = client.get('/coffee')
    
    assert response.status_code == 418
    assert 'teapot' in response.json()['detail'].lower()

Now go forth and write better error messages. And maybe add a teapot endpoint to your next API. 🫖