- Lawrence Berkeley National Laboratory (Berkeley, CA)
- …Lab's ( LBNL ) Information Technology Division ( IT ) has an opening for a Senior HPC Cluster Systems Administrator to join their ScienceIT Team ! In ... by building, integrating, and maintaining Linux-based resources, high-performance computing cluster systems , and Kubernetes clusters. This role provides… more
- NVIDIA Corporation (Santa Clara, CA)
- Senior AI- HPC EDA Cluster ...leadership and strategic mentorship on the management of large-scale HPC systems including the deployment of compute, ... is loaded## Senior AI- HPC EDA Cluster Engineerlocations: US, CA, Santa Clara: US, TX, Austin:...Experience analyzing and tuning performance for a variety of AI/ HPC workloads. Excellent problem-solving to analyze complex systems… more
- NVIDIA Corporation (Santa Clara, CA)
- …to stand out from the crowd: Experience leading large-scale AI Factory or HPC cluster bring-ups or builds* Hands-on experience with NVIDIA networking products ... Senior Solutions Architect, Cluster Design and...validation and troubleshooting* Proven expertise in designing large-scale distributed systems , AI clusters, or HPC infrastructure* Ability… more
- The Voleon Group (Berkeley, CA)
- …multibillion‑dollar asset manager, and we have ambitious goals for the future. As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our ... research compute cluster to meet our growing needs, and you will...in SRE or DevOps roles, preferably working as a senior engineer or tech lead Knowledge of HPC… more
- NVIDIA Corporation (Santa Clara, CA)
- …ETH/IB networking components, storage, etc.) within extensive AI and HPC cluster settings.* Practical knowledge of NVIDIA systems technology such as NCCL, ... Senior Solutions Architect, NVIDIA Cloud Partners page is...with partners and customers.* Experience crafting and deploying large-scale cluster environments.* Practical expertise in data center design, development… more
- Ring Inc (San Francisco, CA)
- …networking, observability, security, disaster recovery, and cost management. Familiarity with HPC cluster management softwares such as Slurm Familiarity with ... and retrieval workloads. Previous success managing engineering teams delivering production-grade, HPC -scale RAG systems . Deep understanding of infra domains:… more
- Ring Inc (Washington, DC)
- …networking, observability, security, disaster recovery, and cost management. Familiarity with HPC cluster management softwares such as Slurm Familiarity with ... and retrieval workloads. Previous success managing engineering teams delivering production‑grade, HPC ‑scale RAG systems . Deep understanding of infra domains:… more
- NVIDIA Corporation (Santa Clara, CA)
- …disability status or any other characteristic protected by law. Similar Jobs (5) Senior Systems Software Engineer, Data Center locations 2 Locations time type ... Senior Software Architect - Data Center Systems...systems , particularly at the SW/HW interface. Understanding of HPC or Deep learning workloads and use of accelerated… more
- Zettabyte (Palo Alto, CA)
- …mindset-comfortable with ambiguity and rapid iteration Bonus qualifications GPU or HPC cluster management experience Understanding of ML/AI workload patterns ... world. Why this role exists We need a Backend Engineer to build the systems that orchestrate GPU clusters for AI workloads. You'll create APIs that handle GPU… more
- Slope (Miami, FL)
- …8+ years of progressive, hands‑on experience designing and implementing high-performance storage systems for compute clusters in HPC , AI, or bare‑metal cloud ... lead the architecture, development, and deployment of our next-generation AI/ HPC storage platform. The role: As a Storage Engineer,...Lustre, Spectrum Scale, or similar) supporting GPU or AI cluster workloads. Solid foundation in Linux systems … more
- Hydra Host, Inc. (Miami, FL)
- …8+ years of progressive, hands-on experience designing and implementing high-performance storage systems for compute clusters in HPC , AI, or bare-metal cloud ... (WekaIO, BeeGFS, Lustre, Spectrum Scale, or similar) supporting GPU or AI cluster workloads.. Solid foundation in Linux systems engineering, automation, and… more
- Hamilton Barnes Associates Limited (San Francisco, CA)
- … systems . Requirements 5+ years' experience building large-scale, fault-tolerant distributed systems (ML inference, HPC , or similar). Proficiency in Python, ... multi- cluster environments. Contributions to open-source ML or inference systems projects. Proven track record of cost optimisation in high-performance compute… more
- Fluidstack (San Francisco, CA)
- …infrastructure. We treat our customers' outcomes as our own, taking pride in the systems we build and the trust we earn. If you're motivated by purpose, obsessed ... join us in building what's next. About the Role Senior / Staff SREs at Fluidstack sit at the...networking, platform engineering, and data center operations to build systems that scale with the demands of AI workloads.… more
- Promote Project (Santa Clara, CA)
- …are seeking a distributed software engineer to join our team! As a Senior engineer, you'll be instrumental in developing and optimizing AI infrastructure services to ... on: Developing solutions at the intersection of machine learning, distributed systems , and high-performance computing, supplying to the advancement of AI… more
- NVIDIA Corporation (Santa Clara, CA)
- …GPU Computing. We are passionate about markets include gaming, automotive, vision, HPC , datacenters and networking in addition to our traditional OEM business. ... integration, strong Linux experience, reliability testing with various telemetries, scale out cluster , test plan development, track record in developing AI tools and… more
- Boson AI (Palo Alto, CA)
- …technologies as we continue to scale. Responsibilities Manage and optimize HPC cluster operations Deploy and maintain infrastructure‑as‑code solutions Support ... About The Role We're looking for a Senior Site Reliability Engineer to help us run...Minimum Qualifications 5+ years of experience in SRE or HPC operations. Proficiency in Linux systems administration… more
- Recruiting From Scratch (San Francisco, CA)
- …to oversubscription, checkpointing, or distributed compute scheduling. Background in HPC , storage systems , virtualization, or cloud infrastructure. Experience ... and candidates. https://www.recruitingfromscratch.com/ Title of Role: Software Engineer (C++ Systems ) Location: San Francisco, CA (On-site) Company Stage of Funding:… more
- Qualcomm (San Diego, CA)
- …strategy and end‑to‑end operation of our global EDA engineering compute estate -covering HPC grids , intelligent job & license scheduling , utilization analytics , ... to reduce denials, increase throughput and fairness, and simplify cross‑ cluster feature management. Telemetry & observability. Build a converged metrics/logs/traces… more
- Fluidstack (Seattle, WA)
- …architecture, CUDA toolkit, GPU drivers, monitoring tools (nvidia-smi, DCGM) Experience with HPC cluster management, job schedulers (Slurm, PBS, LSF), and ... customers' outcomes as our own, taking pride in the systems we build and the trust we earn. If...practical experience) 3+ years (System Engineer) or 5+ years ( Senior System Engineer) in Linux system administration, datacenter operations,… more
- Support Revolution (San Jose, CA)
- …for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company ... us. Job Summary: Supermicro is looking for a passionate senior network validation engineer in our San Jose office...work with cutting-edge technology, providing expert guidance on AI cluster networking. This role supports internal lab and rack… more