logo

{ TECH BLOG }

Python for AI Engineers — From LangChain to Production

10-Mar-2026By Mamina Suman


Python — The Language of AI

Python remains the undisputed language for AI engineering in 2026. The ecosystem has matured significantly with LangChain, LlamaIndex, DSPy, and other frameworks providing high-level abstractions for building AI applications. But moving from prototype to production requires understanding the full stack.

LangChain in 2026

LangChain has evolved from a simple chaining library to a comprehensive framework for building AI applications:

  • LangChain Core — Base abstractions for chains, agents, and tools
  • LangGraph — Build stateful, multi-actor AI applications as graphs
  • LangSmith — Observability, testing, and evaluation platform
  • LangServe — Deploy chains as REST APIs with one command

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{input}")
])

chain = prompt | ChatOpenAI(model="gpt-4o") | StrOutputParser()
result = chain.invoke({"input": "Explain microservices in one paragraph"})

Production Architecture

A production AI application stack in 2026:

  • API Layer — FastAPI with async endpoints for LLM calls
  • Task Queue — Celery or Dramatiq for long-running AI tasks
  • Caching — Redis for prompt/response caching (semantic cache)
  • Vector Store — Qdrant or pgvector for RAG retrieval
  • Monitoring — LangSmith + Prometheus for cost tracking and latency monitoring
  • Guardrails — NeMo Guardrails or Guardrails AI for output validation

Key Production Concerns

When deploying AI to production, focus on: structured output parsing (Pydantic models), retry logic with exponential backoff, token usage tracking and budgets, prompt versioning, A/B testing of prompts, and comprehensive logging of all LLM interactions for debugging and compliance.

FastAPI Production Example

Here's a production-ready FastAPI endpoint with proper error handling, caching, and monitoring:


from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
import redis
import hashlib
import logging

app = FastAPI()
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)
logger = logging.getLogger(__name__)

class QueryRequest(BaseModel):
    query: str
    user_id: str

class QueryResponse(BaseModel):
    answer: str
    tokens_used: int
    cached: bool

@app.post("/api/query", response_model=QueryResponse)
async def process_query(request: QueryRequest):
    # Generate cache key
    cache_key = hashlib.md5(request.query.encode()).hexdigest()
    
    # Check cache
    cached_result = redis_client.get(cache_key)
    if cached_result:
        logger.info(f"Cache hit for user {request.user_id}")
        return QueryResponse(answer=cached_result, tokens_used=0, cached=True)
    
    try:
        # Build LLM chain
        prompt = ChatPromptTemplate.from_messages([
            ("system", "You are a helpful assistant."),
            ("user", "{input}")
        ])
        llm = ChatOpenAI(model="gpt-4o", temperature=0)
        chain = prompt | llm
        
        # Invoke with timeout
        result = await chain.ainvoke({"input": request.query})
        
        # Cache result (24 hour TTL)
        redis_client.setex(cache_key, 86400, result.content)
        
        # Log token usage
        tokens = result.response_metadata.get('token_usage', {}).get('total_tokens', 0)
        logger.info(f"Query processed for user {request.user_id}, tokens: {tokens}")
        
        return QueryResponse(
            answer=result.content,
            tokens_used=tokens,
            cached=False
        )
    except Exception as e:
        logger.error(f"Error processing query: {str(e)}")
        raise HTTPException(status_code=500, detail="Internal server error")

Cost Optimization Strategies

In production, LLM costs can spiral quickly. Here are strategies I've used to reduce costs by 70-80%:

  • Semantic Caching — Cache similar queries using embedding similarity, not just exact matches
  • Prompt Compression — Use LLMLingua or similar tools to compress prompts while preserving meaning
  • Model Routing — Route simple queries to cheaper models (GPT-3.5, Claude Haiku), complex ones to GPT-4
  • Streaming Responses — Stream tokens to users for better UX and allow early termination
  • Batch Processing — Use OpenAI batch API for non-real-time workloads (50% cost reduction)

We implemented semantic caching using pgvector and saw cache hit rates of 40-50% in production, which translated to significant cost savings. The key is setting the similarity threshold correctly—too high and you miss cache hits, too low and you return irrelevant cached responses.

If you have any questions or suggestions for this blog, please leave a comment below. I will get back to you ASAP. For contacting me please use the site's Contact form or you can directly mail me [email protected].

If you have any project or technical challenge on your mind, please be in touch with me here.
For my recent work please visit the portfolio section.