• Product Manager, Health Automation and Resilience

    NVIDIA (Santa Clara, CA)
    …executives. Ways to Stand Out from the Crowd: + Experience with GPU-accelerated compute, HPC systems , or large-scale AI clusters. + Knowledge of Kubernetes ... at scale. If you are motivated by building foundational systems that enable large AI clusters to...enablement, and release readiness. + Track trends in observability, SRE practices, distributed systems , and automated operations… more
    NVIDIA (01/10/26)
    - Save Job - Related Jobs - Block Source
  • MTS - Site Reliability Engineer

    Microsoft Corporation (Redmond, WA)
    …our infrastructure team. In this role, you'll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable ... + **Reliability & Availability** : Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems . + **Observability** : Design and… more
    Microsoft Corporation (12/17/25)
    - Save Job - Related Jobs - Block Source
  • Senior System Software Engineer, Firmware

    NVIDIA (Santa Clara, CA)
    …improved workflows and develop new, leading differentiated solutions. You will interact with HPC , OS, GPU compute, and systems specialist to architect, develop ... graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI - the next era of computing. NVIDIA is a "learning machine" that… more
    NVIDIA (01/10/26)
    - Save Job - Related Jobs - Block Source
  • Senior Security Engineer, Infrastructure Security…

    NVIDIA (TX)
    …AppArmor, or SELinux). Ways To Stand Out from the Crowd: + HPC / AI Security: Experience securing high-performance computing environments, RDMA-based networks, or ... NVIDIA DGX Cloud is the AI supercomputing-as-a-service substrate designed to power the next...massive-scale GPU clusters. You will design automated, resilient security systems that help ensure the integrity of our omni-cloud… more
    NVIDIA (01/10/26)
    - Save Job - Related Jobs - Block Source
  • Senior Product Manager - Observability…

    NVIDIA (Santa Clara, CA)
    …into large‑scale telemetry systems . + Deep knowledge of AI /ML infrastructure, high‑performance computing ( HPC ), networking, and cloud technologies ... NVIDIA has become the platform upon which every new AI -powered application is built. From healthcare research applications to autonomous vehicles, or… more
    NVIDIA (01/06/26)
    - Save Job - Related Jobs - Block Source