Sre Ai Hpc Systems Jobs | Juju - Smarter Job Search

Site Reliability Engineer, HPC and LSF

NVIDIA (Santa Clara, CA)

…fueled by great technology-and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU ... are responsible for the big picture of how our systems relate to each other, we use a breadth...to both product quality and interesting dynamic day-to-day work. SRE 's culture of diversity, intellectual curiosity, problem solving and… more

NVIDIA (01/13/26)
- Save Job - Related Jobs - Block Source
Product Manager, Health Automation and Resilience

NVIDIA (Santa Clara, CA)

…executives. Ways to Stand Out from the Crowd: + Experience with GPU-accelerated compute, HPC systems , or large-scale AI clusters. + Knowledge of Kubernetes ... at scale. If you are motivated by building foundational systems that enable large AI clusters to...enablement, and release readiness. + Track trends in observability, SRE practices, distributed systems , and automated operations… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
MTS - Site Reliability Engineer

Microsoft Corporation (Redmond, WA)

…our infrastructure team. In this role, you'll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable ... + **Reliability & Availability** : Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems . + **Observability** : Design and… more

Microsoft Corporation (12/17/25)
- Save Job - Related Jobs - Block Source
Senior System Software Engineer, Firmware

NVIDIA (Santa Clara, CA)

…improved workflows and develop new, leading differentiated solutions. You will interact with HPC , OS, GPU compute, and systems specialist to architect, develop ... graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI - the next era of computing. NVIDIA is a "learning machine" that… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Senior Security Engineer, Infrastructure Security…

NVIDIA (TX)

…AppArmor, or SELinux). Ways To Stand Out from the Crowd: + HPC / AI Security: Experience securing high-performance computing environments, RDMA-based networks, or ... NVIDIA DGX Cloud is the AI supercomputing-as-a-service substrate designed to power the next...massive-scale GPU clusters. You will design automated, resilient security systems that help ensure the integrity of our omni-cloud… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Senior Product Manager - Observability…

NVIDIA (Santa Clara, CA)

…into large‑scale telemetry systems . + Deep knowledge of AI /ML infrastructure, high‑performance computing ( HPC ), networking, and cloud technologies ... NVIDIA has become the platform upon which every new AI -powered application is built. From healthcare research applications to autonomous vehicles, or… more

NVIDIA (01/06/26)
- Save Job - Related Jobs - Block Source

"Juju

Account Login

Sign Up

Forgot your password?

Advanced Search