AI Infrastructure

Expert Infrastructure Recruitment for Teams Building and Operating AI at Scale

DeepRec.ai supports organisations designing, building, and scaling AI infrastructure that underpins production machine learning and inference platforms in use today. Our AI infrastructure practice focuses on supporting companies hiring specialist engineers across compute, platforms, and systems, where architecture, performance, efficiency, and reliability determine whether AI systems succeed outside the lab.

As AI models move into real-world use, AI infrastructure has become the defining challenge of production AI. Organisations are under increasing pressure to provision, orchestrate, and operate compute and data platforms at scale, meeting strict requirements around latency, throughput, cost, and availability. This has driven unprecedented demand for AI infrastructure capability, and for engineers who can build and operate the systems that inference, training, and experimentation depend on.

DeepRec.ai’s recruitment consultants work closely with teams operating at this level of complexity, giving us a clear view of the skills, experience, and systems required to build production-grade AI. Whether that’s AI platform engineering, GPU and accelerator infrastructure, distributed systems, or inference at scale, we connect organisations with AI engineers who can operate effectively in real-world environments.

Hire AI Infrastructure Talent:

Talk to a Consultant

Find a Job in AI Infrastructure:

Explore Careers

Why do Leading AI Teams Choose DeepRec.ai for AI Infrastructure Hiring?

DeepRec.ai's specialist Infra consultants are trusted by tech pioneers across the UK, Ireland, Germany, Switzerland, and the United States.

Our consultants work directly with teams building and operating production AI systems, giving us first-hand exposure to the architectures, constraints, and trade-offs involved.

Our consultants work directly with teams building and operating AI platforms and infrastructure in production, giving us first-hand exposure to the architectures, trade-offs, and operational realities involved.

This includes teams working on distributed training and inference, high-performance computing, GPU and accelerator clusters, and AI platform reliability, where system-level performance and infrastructure design are critical to deploying AI systems at scale.

Dedicated AI Infrastructure Delivery Teams

DeepRec.ai operates through dedicated divisions and delivery teams, each focused on a specific area of deep tech. This structure allows our AI infrastructure practice to work with depth and continuity, rather than spreading expertise across unrelated markets.

We speak Deep Tech

AI infrastructure is not a generic hiring problem. When you need to hire niche AI talent, you need a specialist who speaks deep tech. We know our serving systems from our pipelines, and we know how to talk about them with top-tier candidates.

Cross-border hiring expertise - SECO & AUG Licensed

As part of Trinnovo Group, DeepRec.ai maintains both SECO and AUG licenses, enabling us to provide compliant cross-border recruitment and employment services across Switzerland and Germany. In addition to permanent hiring, we can payroll talent in-house and manage the full administrative and compliance burden on behalf of our clients. This is supported by an internal compliance team, ensuring hiring processes remain robust, transparent, and aligned with local regulatory requirements.

A Deep Tech Community

Much of the most in-demand AI infrastructure talent does not engage with traditional hiring channels. Through sustained involvement in the deep tech ecosystem, including events, collaboration, and research, DeepRec.ai maintains close ties to the AI infrastructure community, enabling trusted access to engineers and technical leaders who are typically difficult to reach through conventional recruitment. Find out more about DeepRec.ai's social hub here: https://www.deeprec.ai/community.

A Perfect Client Net Promoter Score (+100)

DeepRec.ai maintains a client Net Promoter Score of +100 based on client feedback, a reflection of consistent delivery, clear communication, and long-term partnerships built on trust. For our clients, this typically reflects a recruitment experience that is focused, technically credible, and aligned with the realities of hiring in complex, talent-constrained deep tech markets.

AI Infrastructure Salary Guide

Q1 2026 base salary benchmarks for ML systems, infrastructure, distributed training, model serving, inference, performance, MLOps, and platform engineering roles across major US technology markets, built with fresh insights from DeepRec.ai's recent hiring mandates and candidate database.

Read our salary guide here

AI Inference and Serving Model Efficiency

Alongside our broader AI Infrastructure division, DeepRec.ai has a dedicated team focused purely on AI inference and serving efficiency.

As AI systems move from research environments into production, inference becomes the moment of truth. Latency, throughput, cost per request, hardware utilisation, and system reliability all come under pressure at scale. The engineering challenges shift from experimentation to optimisation, from building models to operating them in live, user-facing environments.

Our inference-focused consultants work with teams building high-performance serving systems, real-time and batch inference pipelines, model optimisation frameworks, and accelerator-aware deployment environments. We support organisations hiring engineers who understand quantisation, model compression, distributed inference, GPU scheduling, and system-level efficiency.

If your priority is deploying models reliably and efficiently in production, explore our AI Inference recruitment expertise to see how we support teams operating at this level.

Learn more

Who We Partner With

We work with organisations building, scaling, and operating AI infrastructure in production, ranging from early-stage teams establishing core platforms to scale-ups expanding distributed systems, and enterprises investing in large-scale AI compute and platform capability.

We also work closely with engineers, researchers, and technical leaders who build and operate AI infrastructure. Many of the people we support are not actively looking for new roles, but are open to conversations about work that is technically meaningful, well-resourced, and aligned with how they want to operate.

Our role is to bring these two sides together thoughtfully, matching organisations with engineers where technical context, expectations, and long-term goals are aligned.

If you're interested in exploring a fulfilling new role in AI infrastructure, learning more about current market trends, or you'd like to hire exceptional talent, our consultants are always available to support you. Please get in touch with us directly, and we'll get back to you as soon as possible:

Contact the team

MEET THE TEAM

Meet Anthony

Anthony Kelly

Co-Founder & MD EU/UK

Meet Sam

Sam Warwick

Senior Consultant - ML Systems + AI Infra

Meet Edward

Edward Killin

Principal Recruitment Consultant

Meet Luke

Luke Weekes

Senior Consultant

Berlin, Germany

Senior Inference Optimization Engineer

Permanent€150000 - €200000 per annum

About our client:Our client is a fast-scaling automation platform that operates cloud-native and AI infrastructure at scale. By embedding autonomous decision-making directly into Kubernetes and cloud environments, the platform continuously optimizes performance, reliability, and efficiency in production, replacing tickets, alerts, and manual tuning with continuous automation that adapts infrastructure as conditions change. The company is trusted by over two thousand organizations, including a number of globally recognized enterprises across technology, automotive, media, and financial services. It operates as a distributed, international team spanning more than thirty countries across Europe, North America, Latin America, and APAC. The business recently reached unicorn status following a strategic investment from a major corporate venture arm, with a valuation now in excess of one billion dollars and strong momentum behind its next phase of growth. About the role: Throughput. Latency. KV cache utilization. Move those three numbers in the right direction, and two things happen. Customers get faster, cheaper inference, and our client's margins improve. That is the entire thesis of this role. Every kernel you tune, every quantization scheme you ship, and every scheduler tweak you land shows up directly in a customer's p99 and on the P&L. This is a high-impact seat, and a high-autonomy one. You will be given the room to lead the technical direction of inference optimization rather than execute someone else's roadmap. The problem is that running LLMs in production is a moving target. The right model and serving configuration for a workload depend on traffic shape, sequence-length distribution, batch dynamics, GPU SKU, memory bandwidth, quantization tolerance, and a dozen other variables that shift week to week. Most teams pick a model once, over-provision GPUs, and absorb the cost. Our client's system makes that decision automatically, continuously matching workloads to the most cost-efficient, best-performing LLM and serving configuration on a customer's infrastructure. The team is building the optimization layer between the model and the hardware, and needs engineers who understand both sides deeply. Stack Python; vLLM; SGLang; TensorRT-LLM; PyTorch; CUDA-adjacent tooling; Kubernetes; gRPC; ClickHouse; PostgreSQL; GCP Pub/Sub; AWS, GCP, and Azure; GitLab CI; ArgoCD; Prometheus; Grafana; Loki; Tempo. RequirementsFive or more years building real ML systems, with a portfolio that shows depth in inference or training infrastructure, not just model training notebooks.Strong Python, with experience building production services rather than scripts.Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM, and a working mental model of why an inference engine performs the way it does on a given GPU.Fluency with quantization tradeoffs. You have measured quality regressions, not just compression ratios.Comfort with distributed systems, including collective communication, sharding strategies, and the practical failure modes of multi-GPU and multi-node setups.A bias toward measurement. You instrument before you optimize, and you can tell the difference between a real win and a benchmark artifact.Self-direction. This role comes with a wide mandate, and you should be excited by that rather than unsettled by it.ResponsibilitiesPush throughput. Continuous batching, speculative decoding, chunked prefill, and kernel-level tuning across vLLM, SGLang, and TensorRT-LLM. Find the ceiling on each GPU SKU, then raise it.Cut latency. Attack TTFT and TPOT separately. Profile, identify the actual bottleneck whether compute, memory bandwidth, scheduling, or networking, and fix it rather than the bottleneck you assumed.Get more out of the KV cache. Paged attention, prefix caching, eviction policies, cache reuse across requests, and quantized KV. This is where a lot of the unrealized throughput lives, and it is an area you will own.Quantize without regressing quality. INT8, INT4, and FP8 across weights, activations, and KV. Empirical work that measures quality on real workloads, not just perplexity benchmarks.Shrink cold starts and memory footprint. Faster init, smarter weight loading, and tighter memory accounting, which is the difference between a model that scales and one that does not.Scale across nodes. Distributed inference topologies, network-aware placement, and checkpointing strategies that do not bottleneck on storage or interconnect.Set the technical direction. Decide what to benchmark, what to adopt, and what to build in-house. Bring the team along with strong writeups and reproducible experiments.

Sam Warwick

Posted 7 days ago

VIEW ROLE

Palo Alto, California, United States

Senior Inference Engineer

Permanent$200000 - $300000 per annum

Senior Inference Engineer AI Video Generation Company (Stealth) | Palo Alto, CA | HybridAbout the Role We are seeking a Senior Inference Engineer to accelerate the performance of our AI-driven video generation products. In this highly technical role, you will operate at the intersection of cutting-edge inference acceleration, GPU parallelism, advanced model deployment, and video generation technologies. Your expertise will drive significant improvements to model speed and efficiency, ensuring our creative AI systems deliver industry-leading user experiences at scale. You will design and optimize inference pipelines, implement state-of-the-art acceleration techniques, and work closely with researchers and engineers across the team to push the boundaries of what's possible in real-time AI deployment. Your efforts will play a foundational role in powering the next generation of our video and language models. What You'll DoAccelerate Inference: Lead and implement advanced inference acceleration techniques, including attention optimization and quantization for efficient model serving.Maximize GPU Parallelism: Engineer and optimize GPU strategies across tensor, sequence, and pipeline parallelism (TP, SP, PP) for maximal efficiency and scalability.Programming for Performance: Develop and optimize high-performance computing kernels and distributed workloads using CUDA and NCCL.Advance AI Deployment: Collaborate with research and engineering teams to bring state-of-the-art video generation and large language models into production.Improve Training Efficiency: Contribute to improvements in model training speed, stability, and resource utilization as part of our deployment lifecycle. (Bonus)Technical Excellence: Drive rigorous code reviews, participate in technical discussions, and mentor fellow engineers on best practices in inference and GPU programming. What We're Looking ForExperience: 5 years of engineering experience, with a strong track record in inference acceleration and model deployment at scale.Inference Mastery: Proven expertise in inference optimization, including quantization, attention acceleration, and deep learning compiler stacks.GPU and Parallelism: Deep knowledge of GPU programming (CUDA, NCCL) and experience with SP, TP, PP, and other forms of parallelism for distributed inference.AI Domain Knowledge: Familiarity with video generation models and large language models (LLMs).Collaboration: Strong cross-discipline communication skills; able to drive shared goals across research and engineering functions.Ownership Mindset: Self-driven, solutions-oriented, and capable of managing ambiguity in a fast-paced startup environment. Nice to HaveExperience with high-throughput video or real-time streaming model deployment.Familiarity with distributed training and optimization toolkits.Contributions to open source projects in AI infrastructure or deep learning compilers.Startup or rapid prototyping experience. What We OfferCompetitive salary commensurate with AI industry benchmarks.Equity in a fast-growing company shaping the future of generative AI.Comprehensive health benefits, monthly stipends, and company retreats.A collaborative, in-office culture focused on building and shipping together.About the Company A well-funded, early-stage AI video generation startup headquartered in Palo Alto, CA. The team is building technology to make video creation seamless, intuitive, and universally accessible through the transformative power of AI. Tight-knit and highly energetic, the company values efficiency, intellectual curiosity, and the ambition to make a meaningful impact on the world.

Sam Warwick

Posted 7 days ago

VIEW ROLE

Philadelphia, Pennsylvania, United States

Machine Learning Engineer (Inference Optimization)

Permanent$250000 - $450000 per annum

Machine Learning Engineer – Inference Optimization Overview We are looking for a Machine Learning Engineer focused on low-latency inference optimization to help build, tune, and productionize high-performance model serving systems. This role sits at the intersection of machine learning, systems engineering, and GPU performance. You will work on inference workloads where latency, throughput, reliability, and hardware efficiency all matter, and where a deep understanding of modern inference runtimes can meaningfully improve production outcomes. You will work closely with researchers and engineers to understand model structure, identify inference bottlenecks, and turn research ideas into efficient production systems. The work may involve other types of models, but focuses on transformer-style architectures and structured inference workloads. You will evaluate and tune frameworks and related serving or compilation systems, while also reasoning about GPU execution, memory layout, batching strategies, precision tradeoffs, and end-to-end latency. What you'll do:Design, build, and optimize low-latency inference systems for production machine learning workloads.Profile model inference pipelines across model execution, runtime configuration, batching, memory movement, serialization, networking, and I/O.Evaluate, integrate, and tune inference runtime systems.Improve latency, throughput, and GPU utilization for production inference workloads.Build and support benchmarking and profiling tools to compare model variants, hardware targets, runtime configurations, and deployment strategies.Debug performance issues involving GPU memory, compute saturation, kernel behavior, CPU/GPU coordination, data movement, and serving-layer overhead.Help shape model and system design choices so that research models are efficient to deploy under real latency constraints.Where necessary, collaborate with lower-level systems or GPU specialists on custom operators, kernel-level optimization, or hardware-specific performance work.What we're looking for:Experience deploying, optimizing, or operating machine learning inference workloads in production or production-like environments.Programming experience in Python, Java, C# etc. and at least one systems language such as C, C , Rust, or Go.Solid understanding of modern ML frameworks such as PyTorch, including model execution, export, tracing, compilation, and performance profiling.Ability to reason about latency, throughput, batching, memory use, GPU utilization, and reliability under real workloads.Strong practical judgment around tradeoffs between model quality, latency, throughput, implementation complexity, and maintainability.Preferred qualifications:Experience optimizing inference for latency-sensitive or high-throughput applications.Experience with model optimization techniques such as quantization, pruning, distillation, operator fusion, graph lowering, custom operators, or model compilation.Exposure to CUDA, Triton language, ROCm, PTX, CuTe, CUTLASS, FlashInfer, or similar low-level GPU programming tools.Experience running inference workloads on Kubernetes or GPU clusters, including scheduling, autoscaling, observability, and resource management.Background in mathematics, physics, computer science, engineering, statistics, or another technical field.Demonstrated ability to improve real-world inference performance beyond a baseline framework implementation.

Sam Warwick

Posted 7 days ago

VIEW ROLE

FIND A JOB