Production Integration Patterns: Rate Limits, Retry Logic, and Error Handling

TL;DR

Production code differs from working code in that it handles failures gracefully. Rate limits (429): don't retry immediately; back off exponentially (1s → 2s → 4s → 8s). Token refresh mid-job: on 401, clear cache, fetch new token, retry. On 4xx errors (except 429): don't retry — the request is invalid. On 5xx errors: use exponential backoff and retry.

Log everything: job ID, start/end timestamps, request count, success count, error count, duration, each failed record's sourceRef. Use structured logging (JSON). Idempotency: always set sourceRef to a stable ID from your source system. If your job crashes and reruns, same sourceRef = already exists (409) = safe to skip.

Complete nightly pattern: read from CRM, format credits with sourceRef, get OAuth token (cached), batch POST credits (100 per request), process response (separate successes/errors), retry failed batches once, send alert if errors > threshold, log summary. Go-live: verify token caching, batch usage, pagination handling, error logging, idempotency implementation, and rollback plan.

The Gap Between Working Code and Production Code

Working code successfully posts some credits and fetches some participants. Production code runs automatically, unattended, at 2 AM when the office is closed. Network hiccups happen. API deployments happen. You need to handle all failure modes gracefully.

The difference:

Working code: "If the request fails, the program crashes."
Production code: "If the request fails, we log it, retry intelligently, alert the ops team if it's truly broken, and the job completes with a summary."

Rate Limits: Understanding 429

The SAP SuccessFactors IM API enforces rate limits. If you make too many requests too fast, you get HTTP 429 Too Many Requests:

HTTP/1.1 429 Too Many Requests
Retry-After: 30

{
  "error": "Rate limit exceeded. Max 100 requests per minute."
}

Common causes of rate limiting:

Fetching a new token per request: Token requests count toward your limit. Without caching, 10,000 credits = 10,000 token requests. Instant rate limit.
Individual POSTs instead of batch: 10,000 individual credit POSTs = 10,000 API requests. Batch into 100-per-request = 100 requests.
Not following pagination correctly: Making multiple requests for the same data because you didn't follow @odata.nextLink.

Exponential Backoff: The Retry Pattern

When you get 429 (or 5xx), don't retry immediately. Back off exponentially:

import requests
import time

def post_with_backoff(url, headers, payload, max_retries=3):
    """POST with exponential backoff for 5xx and 429 errors."""
    for attempt in range(max_retries):
        try:
            resp = requests.post(url, json=payload, headers=headers, timeout=10)

            # 4xx errors (except 429) are not retriable
            if 400 <= resp.status_code < 500 and resp.status_code != 429:
                raise Exception(f"Non-retriable 4xx: {resp.status_code}")

            resp.raise_for_status()
            return resp

        except requests.exceptions.HTTPError as e:
            # 429 or 5xx — should retry
            if e.response.status_code in [429, 500, 502, 503, 504]:
                # Calculate backoff: 1s, 2s, 4s, 8s
                wait_time = 2 ** attempt

                # Check Retry-After header if present
                retry_after = e.response.headers.get("Retry-After")
                if retry_after:
                    wait_time = int(retry_after)

                if attempt < max_retries - 1:
                    print(f"Error {e.response.status_code}, retrying in {wait_time}s...")
                    time.sleep(wait_time)
                else:
                    raise

            else:
                # Other errors — don't retry
                raise

Token Refresh Mid-Job: Handling 401

Your token expires in ~30 minutes. Long-running jobs might span that boundary. When you get 401 Unauthorized mid-job:

def api_call_with_token_refresh(token_mgr, method, url, **kwargs):
    """Make API call, refresh token on 401, retry once."""
    headers = kwargs.get("headers", {})
    headers["Authorization"] = f"Bearer {token_mgr.get_token()}"
    kwargs["headers"] = headers

    resp = requests.request(method, url, **kwargs)

    # On 401, token likely expired mid-request
    if resp.status_code == 401:
        print("Token expired, refreshing...")

        # Clear cache and fetch new token
        token_mgr.expires_at = 0
        new_token = token_mgr.get_token()

        # Retry the request with new token
        headers["Authorization"] = f"Bearer {new_token}"
        resp = requests.request(method, url, **kwargs)

    resp.raise_for_status()
    return resp

Logging: What to Capture

Log every integration job run. Structured logging (JSON format) makes it easy to grep and analyze:

import json
import logging
import uuid
from datetime import datetime

logger = logging.getLogger(__name__)

def log_integration_summary(job_id, start_time, end_time, request_count, success_count, error_count, failed_refs):
    """Log structured summary of the job."""
    duration_sec = (end_time - start_time).total_seconds()
    summary = {
        "job_id": job_id,
        "timestamp": datetime.utcnow().isoformat(),
        "start_time": start_time.isoformat(),
        "end_time": end_time.isoformat(),
        "duration_seconds": duration_sec,
        "request_count": request_count,
        "success_count": success_count,
        "error_count": error_count,
        "success_rate": success_count / request_count if request_count > 0 else 0,
        "failed_refs": failed_refs  # List of sourceRef that failed
    }

    # Log as JSON (ops tools can parse this easily)
    logger.info(json.dumps(summary))

Idempotency: The sourceRef Contract

Idempotency is non-negotiable for production integrations. Your job crashes. You rerun it. Same sourceRef = already exists in IM = 409 Conflict = safe to skip or handle gracefully.

Implementation:

Set sourceRef to a stable ID from your source system: CRM Order ID, ERP Posting ID, etc. Not a UUID or timestamp (those change on each run).
On 409 Conflict: Log it, don't treat it as a failure. The record already exists.
On job restart: Rerun the entire batch. First run posts all 10,000. Second run (from crash/reruns) gets 409 for all 10,000 (already exists) and exits cleanly.

Error Response Handling Decision Table

Status Code	Meaning	Retriable?	Action
4xx (except 429)	Client error. Request is invalid.	No	Log the error, move on. Retrying won't fix an invalid request.
400	Bad Request	No	Check the payload. Missing required field? Wrong format?
401	Unauthorized	Yes	Token expired. Fetch new token, retry once.
403	Forbidden	No	Insufficient permissions. No retry.
404	Not Found	No	Resource doesn't exist. Validate IDs.
409	Conflict	No	Duplicate sourceRef. Already exists. Log and move on (idempotent success).
422	Unprocessable Entity	No	Semantic error (e.g., participant not eligible for this period). Investigate, don't retry.
429	Too Many Requests	Yes	Rate limit hit. Exponential backoff (1s, 2s, 4s, 8s). Retry.
5xx	Server Error	Yes	Transient. Exponential backoff. Retry.

The Complete Nightly Credit Push Pattern

Here's a production-grade integration script that ties all lessons together:

import requests
import json
import logging
import uuid
import time
import os
from datetime import datetime

logging.basicConfig(
    level=logging.INFO,
    format='%(message)s'
)
logger = logging.getLogger(__name__)

class TokenManager:
    """OAuth token manager with caching (from Lesson 2)."""
    def __init__(self, client_id, client_secret, scope):
        self.client_id = client_id
        self.client_secret = client_secret
        self.scope = scope
        self.oauth_endpoint = "https://api.sap.com/oauth/token"
        self.access_token = None
        self.expires_at = 0

    def _fetch_new_token(self):
        data = {
            "grant_type": "client_credentials",
            "client_id": self.client_id,
            "client_secret": self.client_secret,
            "scope": self.scope,
        }
        resp = requests.post(self.oauth_endpoint, data=data)
        resp.raise_for_status()
        result = resp.json()
        self.access_token = result["access_token"]
        self.expires_at = time.time() + result["expires_in"]

    def get_token(self):
        current_time = time.time()
        if current_time >= (self.expires_at - 60):
            self._fetch_new_token()
        return self.access_token

def post_credits_with_backoff(token_mgr, credits_batch):
    """POST credits batch with exponential backoff for 429/5xx."""
    url = "https://api.sap.com/successfactors/im/credits/batch"
    max_retries = 3

    for attempt in range(max_retries):
        try:
            headers = {
                "Authorization": f"Bearer {token_mgr.get_token()}",
                "Content-Type": "application/json"
            }
            payload = {"value": credits_batch}

            resp = requests.post(url, json=payload, headers=headers, timeout=30)

            # Non-retriable 4xx errors
            if 400 <= resp.status_code < 500 and resp.status_code != 429:
                logger.error(f"Non-retriable error {resp.status_code}: {resp.text}")
                raise Exception(f"4xx error: {resp.status_code}")

            resp.raise_for_status()
            return resp.json()

        except requests.exceptions.RequestException as e:
            if attempt < max_retries - 1 and (getattr(e.response, 'status_code', 0) in [429, 500, 502, 503]):
                wait_time = 2 ** attempt
                logger.warning(f"Retriable error, backoff {wait_time}s: {e}")
                time.sleep(wait_time)
            else:
                raise

def nightly_credit_push():
    """Complete nightly credit push from CRM to IM."""
    job_id = str(uuid.uuid4())
    start_time = datetime.utcnow()

    logger.info(json.dumps({"event": "job_started", "job_id": job_id}))

    # Initialize token manager
    token_mgr = TokenManager(
        client_id=os.getenv("ICM_CLIENT_ID"),
        client_secret=os.getenv("ICM_CLIENT_SECRET"),
        scope="im.write"
    )

    # Step 1: Read from CRM (simulated)
    crm_orders = [
        {"order_id": "ORD-98765", "participant_id": "P001234", "amount": 4500.00},
        {"order_id": "ORD-98766", "participant_id": "P001235", "amount": 2250.00}
    ]

    # Step 2: Format as IM credits (with sourceRef for idempotency)
    credits = []
    for order in crm_orders:
        credits.append({
            "participantId": order["participant_id"],
            "transactionDate": datetime.utcnow().strftime("%Y-%m-%d"),
            "amount": order["amount"],
            "currencyCode": "USD",
            "periodId": "Q2-2026",
            "sourceRef": order["order_id"]  # Stable ID for idempotency
        })

    # Step 3: Batch POST (100 per request)
    successes = []
    errors = []
    batch_size = 100

    for i in range(0, len(credits), batch_size):
        batch = credits[i:i+batch_size]
        try:
            result = post_credits_with_backoff(token_mgr, batch)
            results = result.get("results", [])

            # Separate successes and errors
            batch_successes = [r for r in results if r["status"] == "CREATED"]
            batch_errors = [r for r in results if r["status"] == "ERROR"]

            successes.extend(batch_successes)
            errors.extend(batch_errors)

        except Exception as e:
            logger.error(f"Batch {i//batch_size} failed: {e}")

    # Step 4: Log summary
    end_time = datetime.utcnow()
    duration_sec = (end_time - start_time).total_seconds()
    failed_refs = [e["sourceRef"] for e in errors]

    summary = {
        "event": "job_completed",
        "job_id": job_id,
        "duration_seconds": duration_sec,
        "request_count": len(credits),
        "success_count": len(successes),
        "error_count": len(errors),
        "failed_refs": failed_refs
    }
    logger.info(json.dumps(summary))

    # Step 5: Alert if errors > threshold
    if len(errors) > 5:
        logger.critical(f"Job {job_id}: {len(errors)} errors, manual review required")

if __name__ == "__main__":
    nightly_credit_push()

Integration Go-Live Checklist

Before going live, verify all production patterns are in place:

Credentials are not hardcoded. Using environment variables. Secrets manager if available.
Token caching is implemented. TokenManager or equivalent. Fetch new token only when expires_at - 60 passes.
Batch operations are used. Credits: POST /credits/batch (100 per request). Not individual POSTs.
Pagination is correct. GET requests follow @odata.nextLink. Never manually build skip tokens. Always use $orderby when paginating.
Error handling is comprehensive. 4xx (non-retriable), 5xx/429 (exponential backoff), 401 (token refresh).
Idempotency is verified. sourceRef is set to stable ID from source system. Tested: rerun job, verify no duplicates created.
Rate limit handling is in place. Exponential backoff (1s, 2s, 4s, 8s). Respect Retry-After header.
Logging is structured. JSON format. Each job gets a unique job_id. Log: start, end, duration, success count, error count, failed refs.
Monitoring/alerting is configured. Alert ops team if error_count > threshold. Monitor logs for patterns of failure.
Rollback plan exists. If something breaks mid-run, can you revert? Can you re-post the same data safely (idempotency)? Is there a manual escape hatch?

Debugging in Production

When things go wrong at 2 AM, you need good logs:

Job ID: Unique UUID per run. Search logs by job_id to find all related entries.
sourceRef in error logs: Which specific records failed? Log the sourceRef so you can look them up in your CRM/ERP.
Status codes: Always log HTTP status code and response body (sans tokens) when requests fail.
Duration: Is the job taking much longer than usual? A sign of rate limiting or API slowdown.
Success rate: 99% success is good. 80% success means something is systematically broken.

⚠️4xx vs 5xx: The most common mistake is retrying 4xx errors. A 400 Bad Request with a typo in the JSON will fail 1,000 times if you retry it. Only retry 5xx (server broken), 429 (rate limit), and 401 (token expired). Everything else: log, investigate, don't retry.

Summary

You've completed the REST API learning path:

Lesson 1: What REST is, HTTP methods, resources, endpoints, JSON, status codes.
Lesson 2: OAuth 2.0, client credentials, token requests, token caching, security.
Lesson 3: GET requests, OData query parameters, pagination, performance optimization.
Lesson 4: POST/PUT/PATCH, batch operations, credit transactions, quota updates, pipeline triggers.
Lesson 5: Rate limits, exponential backoff, logging, idempotency, complete production pattern.

You can now build production-grade integrations between your CRM, HR, payroll, and BI systems and SAP SuccessFactors IM. The nightly credit push pattern is the template for any ICM integration.

💡Next steps: Test your integration against a sandbox SAP SuccessFactors IM instance. Run the nightly credit push script 3-5 times, simulate failures (kill the job mid-run), verify idempotency, check logs. Then move to production with confidence.

You've Completed the REST API Learning Path

You understand the full lifecycle: authentication, reading data, writing data, and production patterns. Ready to integrate?

← Back to Hub