SRE AI HPC Systems Jobs | Juju - Smarter Job Search

Product Manager, Health Automation and Resilience

NVIDIA (Santa Clara, CA)

…executives. Ways to Stand Out from the Crowd: + Experience with GPU-accelerated compute, HPC systems , or large-scale AI clusters. + Knowledge of Kubernetes ... at scale. If you are motivated by building foundational systems that enable large AI clusters to...enablement, and release readiness. + Track trends in observability, SRE practices, distributed systems , and automated operations… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
MTS - Site Reliability Engineer

Microsoft Corporation (Redmond, WA)

…our infrastructure team. In this role, you'll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable ... + **Reliability & Availability** : Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems . + **Observability** : Design and… more

Microsoft Corporation (12/17/25)
- Save Job - Related Jobs - Block Source
Senior System Software Engineer, Firmware

NVIDIA (Santa Clara, CA)

…improved workflows and develop new, leading differentiated solutions. You will interact with HPC , OS, GPU compute, and systems specialist to architect, develop ... graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI - the next era of computing. NVIDIA is a "learning machine" that… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Senior Security Engineer, Infrastructure Security…

NVIDIA (TX)

…AppArmor, or SELinux). Ways To Stand Out from the Crowd: + HPC / AI Security: Experience securing high-performance computing environments, RDMA-based networks, or ... NVIDIA DGX Cloud is the AI supercomputing-as-a-service substrate designed to power the next...massive-scale GPU clusters. You will design automated, resilient security systems that help ensure the integrity of our omni-cloud… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Senior Product Manager - Observability…

NVIDIA (Santa Clara, CA)

…into large‑scale telemetry systems . + Deep knowledge of AI /ML infrastructure, high‑performance computing ( HPC ), networking, and cloud technologies ... NVIDIA has become the platform upon which every new AI -powered application is built. From healthcare research applications to autonomous vehicles, or… more

NVIDIA (01/06/26)
- Save Job - Related Jobs - Block Source

"Juju

Account Login

Sign Up

Forgot your password?

Advanced Search