- Oracle (Seattle, WA)
- …Design and code solutions for performance benchmarking . + Troubleshoot performance problems on RDMA clusters and perform cluster performance ... team strives to be the go-to experts on RDMA cluster architecture and its relationship to AI/ML/HPC performance...with 5+ years of relevant experience + Experience with benchmarking and troubleshooting or optimizing performance of… more
- NVIDIA (Santa Clara, CA)
- …lasting impact on the world. We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA and ... high- performance computing workloads used across multiple teams and projects....experience crafting and operating large scale compute infrastructure, including cluster configuration managements tools such as BCM or Ansible.… more
- NVIDIA (Santa Clara, CA)
- …breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek a technical leader ... compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment,… more
- NVIDIA (Santa Clara, CA)
- …graphics. Design and implement GPU compute clusters for deep learning and high- performance computing. What you'll be doing: + Provide leadership and strategic ... user needs. + Support our researchers to run their workloads including performance analysis and optimizations. + Conduct root cause analysis and suggest corrective… more
- NVIDIA (Santa Clara, CA)
- … performance and AI workloads on large scale systems + Experience with performance modeling and benchmarking at scale + Strong background in Computer ... seeking highly skilled Parallel and Distributed Systems engineers to drive the performance analysis, optimization, and modeling to define the architecture and design… more
- Walmart (Sunnyvale, CA)
- …atop for advanced debugging. + Perform deep analysis of OSD, MON, MDS, RGW performance and optimize cluster parameters. + Debug network congestion, packet loss, ... hardware (NVMe SSDs, RDMA NICs, high-density HDDs) and their impact on storage performance . + Evaluate next-gen server SKUs, perform benchmarking , and compare… more
- NVIDIA (Santa Clara, CA)
- … performance and AI workloads on large scale systems + Experience with performance modeling and benchmarking at scale + Strong background in Computer ... seeking highly skilled Parallel and Distributed Systems engineers to drive the performance analysis, optimization, and modeling to define the architecture and design… more
- NVIDIA (Santa Clara, CA)
- …cloud compute technologies, eg: SLURM, Lustre, k8s + Software and hardware Benchmarking on high- performance computing systems. #LI-Hybrid Your base salary will ... dedicated and motivated senior build and continuous integration (CI/CD) engineer for its GenAI Frameworks (Megatron-LM (https://github.com/NVIDIA/Megatron-LM) and NeMo… more
- Northrop Grumman (Jessup, MD)
- …making history. We are looking for you to join our team as a High- Performance Computing ( **HPC** ) **Network Engineer ** based out of **Annapolis Junction ... **Responsibilities** + Monitor and maintain performance of network within a high- performance compute cluster + Contribute to design of new high-… more
- NVIDIA (Santa Clara, CA)
- …Kubernetes deployment patterns for NIMs, including GPU scheduling, autoscaling, and multi- cluster rollouts. + Optimize container performance : layer layout, ... understanding difference inference backends (vLLM, SGLang, TRT-LLM) + Background in benchmarking and optimizing inference container performance and startup… more
- NVIDIA (Santa Clara, CA)
- …Kubernetes deployment patterns for NIMs, including GPU scheduling, autoscaling, and multi- cluster rollouts. + Optimize container performance : layer layout, ... understanding difference inference backends (vLLM, SGLang, TRT-LLM) + Background in benchmarking and optimizing inference container performance and startup… more
- Snap Inc. (Seattle, WA)
- …coalescing, and slot-aware load balancing. + Implement robust failover, replication, and cluster topology management and optimize cpu performance , memory usage, ... privacy at the forefront. We're looking for a Software Engineer to join Snap Inc on our Core Infrastructure...layers or custom client lib). + Develop and maintain high- performance caching proxies or client side libraries for request… more
- Pfizer (South San Francisco, CA)
- …of production computing platforms. + Perform troubleshooting, system analysis, and benchmarking to resolve issues and maintain a high- performance environment. ... a hands-on approach to designing and delivering robust High Performance Computing (HPC) solutions supporting computational workloads across the organization.… more
- Microsoft Corporation (Mountain View, CA)
- …Broadcom, and in-house silicon/network co-design teams + AI training + inference cluster bring-up, performance benchmarking , and root-cause analysis + ... Staff, AI Networking to design and scale the world's most advanced high- performance networks powering Copilot and next-generation AI systems. Join the team building… more