Implement retry với exponential backoff cho LLM API call (handle rate limit, timeout).

LLM API fail vì nhiều lý do: rate limit (429), provider overload (503), timeout, network jitter.

Retry logic tốt là yêu cầu cơ bản trong production.

python

import time, random, logging
from typing import Callable, TypeVar
from openai import OpenAI, RateLimitError, APIError, APITimeoutError

T = TypeVar("T")
log = logging.getLogger(__name__)

def retry_with_backoff(
    fn: Callable[[], T],
    *,
    max_attempts: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    jitter: bool = True,
    retryable=(RateLimitError, APITimeoutError, APIError),
) -> T:
    """Exponential backoff with jitter + respect Retry-After header."""
    for attempt in range(max_attempts):
        try:
            return fn()
        except retryable as e:
            if attempt == max_attempts - 1:
                raise  # last attempt, bubble up
            
            # Respect Retry-After header (OpenAI trả về khi 429)
            retry_after = getattr(e, "retry_after", None)
            if retry_after:
                delay = float(retry_after)
            else:
                # Exponential: 1s, 2s, 4s, 8s, 16s, capped
                delay = min(base_delay * 2**attempt, max_delay)
                if jitter:
                    # "Full jitter" — chống thundering herd
                    delay = random.uniform(0, delay)
            
            log.warning(
                f"Attempt {attempt+1}/{max_attempts} failed: {e}. "
                f"Retrying in {delay:.1f}s"
            )
            time.sleep(delay)

# --- USAGE ---
client = OpenAI()

def call_llm(prompt: str) -> str:
    return retry_with_backoff(
        lambda: client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            timeout=30,
        ).choices[0].message.content,
        max_attempts=5,
    )

Version async (production thực tế):

python

import asyncio
from openai import AsyncOpenAI

async def retry_async(fn, max_attempts=5, base=1.0, cap=60.0):
    for i in range(max_attempts):
        try:
            return await fn()
        except (RateLimitError, APITimeoutError) as e:
            if i == max_attempts - 1: raise
            delay = min(base * 2**i, cap)
            delay = random.uniform(0, delay)
            await asyncio.sleep(delay)

Thư viện production ready (khuyến nghị thay vì tự viết):

tenacity — Python decorator mạnh:

python

from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

@retry(
    wait=wait_exponential(multiplier=1, min=1, max=60),
    stop=stop_after_attempt(5),
    retry=retry_if_exception_type((RateLimitError, APITimeoutError)),
)
def call_llm(prompt): ...

backoff — Python library, decorator đơn giản.
OpenAI SDK built-in — SDK mới có max_retries param sẵn: OpenAI(max_retries=5).

Best practices production:

1. Idempotency key (OpenAI support idempotency_key) — tránh duplicate billing khi retry.
2. Respect Retry-After header — provider nói 30s thì không spam retry sớm hơn.
3. Jitter — full jitter (random 0-delay) chống thundering herd khi nhiều client cùng retry.
4. Different strategies per error:
- 429 rate limit → wait theo Retry-After.
- 500/503 server error → exponential backoff.
- 400/401/403 → KHÔNG retry (lỗi request).
- Timeout → retry nhưng giới hạn.
5. Circuit breaker — nếu error rate > threshold → trip, fallback sang provider khác hoặc reject sớm. Library: pybreaker.
6. Fallback model — primary fail → downgrade sang model khác (GPT-4o → Claude 3.5 Sonnet → Haiku).
7. Budget retry — giới hạn tổng retry per user/feature để tránh runaway cost.
8. Log với trace ID — mỗi attempt log với request_id để debug.
9. Metrics — track retry rate, success-after-retry rate; spike → investigate.
10. Deadline budget — với user-facing request, tổng latency có ceiling (VD 10s). Dynamic reduce retry attempts khi gần deadline.

Gateway giải pháp: LiteLLM, Portkey handle retry/fallback/circuit breaker transparently → không cần code riêng.

LLM APIs fail for many reasons: rate limits (429), provider overload (503), timeouts, network jitter.

Good retry logic is a production basic.

python

import time, random, logging
from typing import Callable, TypeVar
from openai import OpenAI, RateLimitError, APIError, APITimeoutError

T = TypeVar("T")
log = logging.getLogger(__name__)

def retry_with_backoff(
    fn: Callable[[], T],
    *,
    max_attempts: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    jitter: bool = True,
    retryable=(RateLimitError, APITimeoutError, APIError),
) -> T:
    """Exponential backoff with jitter + respects Retry-After."""
    for attempt in range(max_attempts):
        try:
            return fn()
        except retryable as e:
            if attempt == max_attempts - 1:
                raise  # last attempt, bubble up
            
            # Respect Retry-After header (OpenAI returns on 429)
            retry_after = getattr(e, "retry_after", None)
            if retry_after:
                delay = float(retry_after)
            else:
                # Exponential: 1s, 2s, 4s, 8s, 16s, capped
                delay = min(base_delay * 2**attempt, max_delay)
                if jitter:
                    # "Full jitter" — avoid thundering herd
                    delay = random.uniform(0, delay)
            
            log.warning(
                f"Attempt {attempt+1}/{max_attempts} failed: {e}. "
                f"Retrying in {delay:.1f}s"
            )
            time.sleep(delay)

# --- USAGE ---
client = OpenAI()

def call_llm(prompt: str) -> str:
    return retry_with_backoff(
        lambda: client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            timeout=30,
        ).choices[0].message.content,
        max_attempts=5,
    )

Async version (real production):

python

import asyncio
from openai import AsyncOpenAI

async def retry_async(fn, max_attempts=5, base=1.0, cap=60.0):
    for i in range(max_attempts):
        try:
            return await fn()
        except (RateLimitError, APITimeoutError) as e:
            if i == max_attempts - 1: raise
            delay = min(base * 2**i, cap)
            delay = random.uniform(0, delay)
            await asyncio.sleep(delay)

Production-ready libraries (prefer over hand-rolled):

tenacity — powerful Python decorator:

python

from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

@retry(
    wait=wait_exponential(multiplier=1, min=1, max=60),
    stop=stop_after_attempt(5),
    retry=retry_if_exception_type((RateLimitError, APITimeoutError)),
)
def call_llm(prompt): ...

backoff — simple Python decorator library.
OpenAI SDK built-in — newer SDKs accept max_retries: OpenAI(max_retries=5).

Production best practices:

1. Idempotency key (OpenAI supports idempotency_key) — prevents double billing on retry.
2. Respect Retry-After header — provider says 30s, don't retry sooner.
3. Jitter — full jitter (random 0-delay) avoids thundering herd when many clients retry together.
4. Per-error strategies:
- 429 rate limit → honor Retry-After.
- 500/503 server error → exponential backoff.
- 400/401/403 → DO NOT retry (request error).
- Timeout → retry but bounded.
5. Circuit breaker — when error rate > threshold → trip, fall back to another provider or reject fast. Library: pybreaker.
6. Fallback model — primary fails → downgrade (GPT-4o → Claude 3.5 Sonnet → Haiku).
7. Retry budget — cap total retries per user/feature to avoid runaway cost.
8. Log with trace ID — every attempt logged with request_id for debugging.
9. Metrics — track retry rate, success-after-retry rate; spikes → investigate.
10. Deadline budget — for user-facing requests with a total latency ceiling (e.g. 10s), dynamically reduce retry attempts as the deadline approaches.

Gateway solutions: LiteLLM, Portkey handle retry/fallback/circuit-breaker transparently → no need to code yourself.

Xem toàn bộ AI Engineering cùng filter theo level & chủ đề con.

Mở danh sách AI Engineering

Implement retry với exponential backoff cho LLM API call (handle rate limit, timeout).

LLM API fail vì nhiều lý do: rate limit (429), provider overload (503), timeout, network jitter.

Retry logic tốt là yêu cầu cơ bản trong production.

python

import time, random, logging
from typing import Callable, TypeVar
from openai import OpenAI, RateLimitError, APIError, APITimeoutError

T = TypeVar("T")
log = logging.getLogger(__name__)

def retry_with_backoff(
    fn: Callable[[], T],
    *,
    max_attempts: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    jitter: bool = True,
    retryable=(RateLimitError, APITimeoutError, APIError),
) -> T:
    """Exponential backoff with jitter + respect Retry-After header."""
    for attempt in range(max_attempts):
        try:
            return fn()
        except retryable as e:
            if attempt == max_attempts - 1:
                raise  # last attempt, bubble up
            
            # Respect Retry-After header (OpenAI trả về khi 429)
            retry_after = getattr(e, "retry_after", None)
            if retry_after:
                delay = float(retry_after)
            else:
                # Exponential: 1s, 2s, 4s, 8s, 16s, capped
                delay = min(base_delay * 2**attempt, max_delay)
                if jitter:
                    # "Full jitter" — chống thundering herd
                    delay = random.uniform(0, delay)
            
            log.warning(
                f"Attempt {attempt+1}/{max_attempts} failed: {e}. "
                f"Retrying in {delay:.1f}s"
            )
            time.sleep(delay)

# --- USAGE ---
client = OpenAI()

def call_llm(prompt: str) -> str:
    return retry_with_backoff(
        lambda: client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            timeout=30,
        ).choices[0].message.content,
        max_attempts=5,
    )

Version async (production thực tế):

python

import asyncio
from openai import AsyncOpenAI

async def retry_async(fn, max_attempts=5, base=1.0, cap=60.0):
    for i in range(max_attempts):
        try:
            return await fn()
        except (RateLimitError, APITimeoutError) as e:
            if i == max_attempts - 1: raise
            delay = min(base * 2**i, cap)
            delay = random.uniform(0, delay)
            await asyncio.sleep(delay)

Thư viện production ready (khuyến nghị thay vì tự viết):

tenacity — Python decorator mạnh:

python

from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

@retry(
    wait=wait_exponential(multiplier=1, min=1, max=60),
    stop=stop_after_attempt(5),
    retry=retry_if_exception_type((RateLimitError, APITimeoutError)),
)
def call_llm(prompt): ...

backoff — Python library, decorator đơn giản.
OpenAI SDK built-in — SDK mới có max_retries param sẵn: OpenAI(max_retries=5).

Best practices production:

Gateway giải pháp: LiteLLM, Portkey handle retry/fallback/circuit breaker transparently → không cần code riêng.

LLM APIs fail for many reasons: rate limits (429), provider overload (503), timeouts, network jitter.

Good retry logic is a production basic.

python

import time, random, logging
from typing import Callable, TypeVar
from openai import OpenAI, RateLimitError, APIError, APITimeoutError

T = TypeVar("T")
log = logging.getLogger(__name__)

def retry_with_backoff(
    fn: Callable[[], T],
    *,
    max_attempts: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    jitter: bool = True,
    retryable=(RateLimitError, APITimeoutError, APIError),
) -> T:
    """Exponential backoff with jitter + respects Retry-After."""
    for attempt in range(max_attempts):
        try:
            return fn()
        except retryable as e:
            if attempt == max_attempts - 1:
                raise  # last attempt, bubble up
            
            # Respect Retry-After header (OpenAI returns on 429)
            retry_after = getattr(e, "retry_after", None)
            if retry_after:
                delay = float(retry_after)
            else:
                # Exponential: 1s, 2s, 4s, 8s, 16s, capped
                delay = min(base_delay * 2**attempt, max_delay)
                if jitter:
                    # "Full jitter" — avoid thundering herd
                    delay = random.uniform(0, delay)
            
            log.warning(
                f"Attempt {attempt+1}/{max_attempts} failed: {e}. "
                f"Retrying in {delay:.1f}s"
            )
            time.sleep(delay)

# --- USAGE ---
client = OpenAI()

def call_llm(prompt: str) -> str:
    return retry_with_backoff(
        lambda: client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            timeout=30,
        ).choices[0].message.content,
        max_attempts=5,
    )

Async version (real production):

python

import asyncio
from openai import AsyncOpenAI

async def retry_async(fn, max_attempts=5, base=1.0, cap=60.0):
    for i in range(max_attempts):
        try:
            return await fn()
        except (RateLimitError, APITimeoutError) as e:
            if i == max_attempts - 1: raise
            delay = min(base * 2**i, cap)
            delay = random.uniform(0, delay)
            await asyncio.sleep(delay)

Production-ready libraries (prefer over hand-rolled):

tenacity — powerful Python decorator:

python

from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

@retry(
    wait=wait_exponential(multiplier=1, min=1, max=60),
    stop=stop_after_attempt(5),
    retry=retry_if_exception_type((RateLimitError, APITimeoutError)),
)
def call_llm(prompt): ...

backoff — simple Python decorator library.
OpenAI SDK built-in — newer SDKs accept max_retries: OpenAI(max_retries=5).

Production best practices:

Gateway solutions: LiteLLM, Portkey handle retry/fallback/circuit-breaker transparently → no need to code yourself.

Xem toàn bộ AI Engineering cùng filter theo level & chủ đề con.

Mở danh sách AI Engineering