• Site Reliability Engineer, HPC and LSF

    NVIDIA (Santa Clara, CA)
    …fueled by great technology-and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU ... are responsible for the big picture of how our systems relate to each other, we use a breadth...to both product quality and interesting dynamic day-to-day work. SRE 's culture of diversity, intellectual curiosity, problem solving and… more
    NVIDIA (01/13/26)
    - Save Job - Related Jobs - Block Source
  • Product Manager, Health Automation and Resilience

    NVIDIA (Santa Clara, CA)
    …executives. Ways to Stand Out from the Crowd: + Experience with GPU-accelerated compute, HPC systems , or large-scale AI clusters. + Knowledge of Kubernetes ... at scale. If you are motivated by building foundational systems that enable large AI clusters to...enablement, and release readiness. + Track trends in observability, SRE practices, distributed systems , and automated operations… more
    NVIDIA (01/10/26)
    - Save Job - Related Jobs - Block Source
  • MTS - Site Reliability Engineer

    Microsoft Corporation (Redmond, WA)
    …our infrastructure team. In this role, you'll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable ... + **Reliability & Availability** : Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems . + **Observability** : Design and… more
    Microsoft Corporation (12/17/25)
    - Save Job - Related Jobs - Block Source
  • Senior System Software Engineer, Firmware

    NVIDIA (Santa Clara, CA)
    …improved workflows and develop new, leading differentiated solutions. You will interact with HPC , OS, GPU compute, and systems specialist to architect, develop ... graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI - the next era of computing. NVIDIA is a "learning machine" that… more
    NVIDIA (01/10/26)
    - Save Job - Related Jobs - Block Source
  • Senior Security Engineer, Infrastructure Security…

    NVIDIA (TX)
    …AppArmor, or SELinux). Ways To Stand Out from the Crowd: + HPC / AI Security: Experience securing high-performance computing environments, RDMA-based networks, or ... NVIDIA DGX Cloud is the AI supercomputing-as-a-service substrate designed to power the next...massive-scale GPU clusters. You will design automated, resilient security systems that help ensure the integrity of our omni-cloud… more
    NVIDIA (01/10/26)
    - Save Job - Related Jobs - Block Source
  • Senior Product Manager - Observability…

    NVIDIA (Santa Clara, CA)
    …into large‑scale telemetry systems . + Deep knowledge of AI /ML infrastructure, high‑performance computing ( HPC ), networking, and cloud technologies ... NVIDIA has become the platform upon which every new AI -powered application is built. From healthcare research applications to autonomous vehicles, or… more
    NVIDIA (01/06/26)
    - Save Job - Related Jobs - Block Source