• Performance Benchmarking

    Oracle (Seattle, WA)
    …Design and code solutions for performance benchmarking . + Troubleshoot performance problems on RDMA clusters and perform cluster performance ... team strives to be the go-to experts on RDMA cluster architecture and its relationship to AI/ML/HPC performance...with 5+ years of relevant experience + Experience with benchmarking and troubleshooting or optimizing performance of… more
    Oracle (11/25/25)
    - Save Job - Related Jobs - Block Source
  • Senior HPC Cluster Engineer - EDA

    NVIDIA (Santa Clara, CA)
    …lasting impact on the world. We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA and ... high- performance computing workloads used across multiple teams and projects....experience crafting and operating large scale compute infrastructure, including cluster configuration managements tools such as BCM or Ansible.… more
    NVIDIA (12/10/25)
    - Save Job - Related Jobs - Block Source
  • Senior AI and ML HPC Cluster

    NVIDIA (Santa Clara, CA)
    …breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek a technical leader ... compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment,… more
    NVIDIA (10/19/25)
    - Save Job - Related Jobs - Block Source
  • Senior AI-HPC Cluster Engineer

    NVIDIA (Santa Clara, CA)
    …graphics. Design and implement GPU compute clusters for deep learning and high- performance computing. What you'll be doing: + Provide leadership and strategic ... user needs. + Support our researchers to run their workloads including performance analysis and optimizations. + Conduct root cause analysis and suggest corrective… more
    NVIDIA (10/30/25)
    - Save Job - Related Jobs - Block Source
  • Senior DGX Cloud Performance

    NVIDIA (Santa Clara, CA)
    performance and AI workloads on large scale systems + Experience with performance modeling and benchmarking at scale + Strong background in Computer ... seeking highly skilled Parallel and Distributed Systems engineers to drive the performance analysis, optimization, and modeling to define the architecture and design… more
    NVIDIA (11/21/25)
    - Save Job - Related Jobs - Block Source
  • Principal, Software Engineer - Cloud…

    Walmart (Sunnyvale, CA)
    …atop for advanced debugging. + Perform deep analysis of OSD, MON, MDS, RGW performance and optimize cluster parameters. + Debug network congestion, packet loss, ... hardware (NVMe SSDs, RDMA NICs, high-density HDDs) and their impact on storage performance . + Evaluate next-gen server SKUs, perform benchmarking , and compare… more
    Walmart (11/20/25)
    - Save Job - Related Jobs - Block Source
  • Senior DGX Cloud Performance

    NVIDIA (Santa Clara, CA)
    performance and AI workloads on large scale systems + Experience with performance modeling and benchmarking at scale + Strong background in Computer ... seeking highly skilled Parallel and Distributed Systems engineers to drive the performance analysis, optimization, and modeling to define the architecture and design… more
    NVIDIA (10/22/25)
    - Save Job - Related Jobs - Block Source
  • Senior MLOps Engineer , GenAI Framework

    NVIDIA (Santa Clara, CA)
    …cloud compute technologies, eg: SLURM, Lustre, k8s + Software and hardware Benchmarking on high- performance computing systems. #LI-Hybrid Your base salary will ... dedicated and motivated senior build and continuous integration (CI/CD) engineer for its GenAI Frameworks (Megatron-LM (https://github.com/NVIDIA/Megatron-LM) and NeMo… more
    NVIDIA (10/15/25)
    - Save Job - Related Jobs - Block Source
  • Principal / Sr. Principal HPC Network…

    Northrop Grumman (Jessup, MD)
    …making history. We are looking for you to join our team as a High- Performance Computing ( **HPC** ) **Network Engineer ** based out of **Annapolis Junction ... **Responsibilities** + Monitor and maintain performance of network within a high- performance compute cluster + Contribute to design of new high-… more
    Northrop Grumman (12/05/25)
    - Save Job - Related Jobs - Block Source
  • Senior Software Engineer - NIM Factory…

    NVIDIA (Santa Clara, CA)
    …Kubernetes deployment patterns for NIMs, including GPU scheduling, autoscaling, and multi- cluster rollouts. + Optimize container performance : layer layout, ... understanding difference inference backends (vLLM, SGLang, TRT-LLM) + Background in benchmarking and optimizing inference container performance and startup… more
    NVIDIA (12/11/25)
    - Save Job - Related Jobs - Block Source
  • Senior Software Engineer - NIM Factory…

    NVIDIA (Santa Clara, CA)
    …Kubernetes deployment patterns for NIMs, including GPU scheduling, autoscaling, and multi- cluster rollouts. + Optimize container performance : layer layout, ... understanding difference inference backends (vLLM, SGLang, TRT-LLM) + Background in benchmarking and optimizing inference container performance and startup… more
    NVIDIA (09/19/25)
    - Save Job - Related Jobs - Block Source
  • Staff Software Engineer , Level 6

    Snap Inc. (Seattle, WA)
    …coalescing, and slot-aware load balancing. + Implement robust failover, replication, and cluster topology management and optimize cpu performance , memory usage, ... privacy at the forefront. We're looking for a Software Engineer to join Snap Inc on our Core Infrastructure...layers or custom client lib). + Develop and maintain high- performance caching proxies or client side libraries for request… more
    Snap Inc. (09/12/25)
    - Save Job - Related Jobs - Block Source
  • Sr Manager, Cloud Infrastructure Engineer

    Pfizer (South San Francisco, CA)
    …of production computing platforms. + Perform troubleshooting, system analysis, and benchmarking to resolve issues and maintain a high- performance environment. ... a hands-on approach to designing and delivering robust High Performance Computing (HPC) solutions supporting computational workloads across the organization.… more
    Pfizer (12/03/25)
    - Save Job - Related Jobs - Block Source
  • Member of Technical Staff, AI Networking

    Microsoft Corporation (Mountain View, CA)
    …Broadcom, and in-house silicon/network co-design teams + AI training + inference cluster bring-up, performance benchmarking , and root-cause analysis + ... Staff, AI Networking to design and scale the world's most advanced high- performance networks powering Copilot and next-generation AI systems. Join the team building… more
    Microsoft Corporation (12/11/25)
    - Save Job - Related Jobs - Block Source