- NVIDIA (Santa Clara, CA)
- …a lasting impact on the world. We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA ... doing: + Provide leadership and strategic mentorship on the management of large-scale HPC systems including the deployment of compute, networking, and storage. +… more
- NVIDIA (Santa Clara, CA)
- …doing: + Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage. + ... Centos/RHEL and/or Ubuntu Linux distributions + Solid understanding of cluster configuration managements tools such as Ansible, Puppet, Salt...IPoIB and RDMA + Understanding of fast, distributed storage systems like Lustre and GPFS for AI/ HPC … more
- NVIDIA (Santa Clara, CA)
- …doing: + Provide leadership and strategic mentorship on the management of large-scale HPC systems including the deployment of compute, networking, and storage. + ... ). + Experience analyzing and tuning performance for a variety of AI/ HPC workloads. Excellent problem-solving to analyze complex systems , identify bottlenecks,… more
- Massachusetts Institute of Technology (Cambridge, MA)
- Senior HPC Systems Engineer + Job Number: 25342 + Functional Area: Information Technology + Department: MA Green High Performance Computing Ctr + School ... Email a Friend Save Save Apply Now Posting Description SENIOR HPC SYSTEMS ENGINEER, The...Requirements REQUIRED: Solid track record in HPC systems administration/engineering including hands-on experience with Linux-based cluster… more
- Texas A&M University System (College Station, TX)
- Job Title Senior HPC Engineer Agency Texas A&M...expertise and consultation for the design and deployment of HPC systems . Get in on the ground ... sensitive requiring US Citizenship. Opportunities to Contribute * Manage large-scale HPC cluster operations, including OS upgrades, firmware patching, and… more
- Mount Sinai Health System (New York, NY)
- …research clinical data warehouse team and a research data services team. The ** Senior HPC Administrator, High Performance Computational and Data Ecosystem** , is ... a strong focus on customer service for researchers. The HPC Senior Administrator is an expert troubleshooter...and operations (preferably in a Redhat/CentOS Linux administration, Batch HPC cluster environment) + Must be an… more
- NVIDIA (Santa Clara, CA)
- …to stand out from the crowd: + Experience leading large-scale AI Factory or HPC cluster bring-ups or builds + Hands-on experience with NVIDIA networking products ... world's fastest supercomputers. We are seeing a highly motivated Senior Solutions Architect to join the Cluster ...and troubleshooting + Proven expertise in designing large-scale distributed systems , AI clusters, or HPC infrastructure +… more
- NVIDIA (Santa Clara, CA)
- …level knowledge of Linux system administration and management. + Understanding of cluster management systems (Kubernetes, SLURM) + Understanding of performance, ... with RDMA networking. + Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, SLURM.) Hands-on experience in Machine… more
- NVIDIA (Santa Clara, CA)
- Join the NVIDIA Deep Learning Frameworks Infrastructure team as a Senior Systems Engineer focusing on High-Performance AI & Networking Applications, committed to ... or AI clusters in production environments. + Understanding of fast, distributed storage systems like Lustre and GPFS for AI/ HPC workload. + Experience with… more
- Mount Sinai Health System (New York, NY)
- …and implements backup policies. + Assist in the management and maintenance of HPC cluster and data center work, including troubleshooting for resolving system ... Sinai. The Administrator is the principal technology expert for Windows and Linux systems , and help support high-performance computing ( HPC ) environment in the… more
- Boston College (Chestnut Hill, MA)
- …departments and institutes. This position primarily provides Linux server administration for the HPC cluster . Additionally this role sets up and manages physical ... hardware (particularly GPU), job scheduling, and trends that improve performance of HPC systems . Helps promote technologies for teaching and research, and… more
- Insight Global (Louisville, CO)
- …space economy. In this position, you will support a handful of programs that use HPC 's cluster to do their runs. You will troubleshoot schedulers, Monte Carlo ... the commercial space industry is looking for a Sr Systems Integration Engineer to join their growing corporate IT...experience within Linux 5+ years of experience working with HPC (deploying and maintaining on prem HPC … more
- Oracle (Annapolis, MD)
- …Description** The AI2NE Org strives to be global leaders in the RDMA cluster networking domain and enable seamless, accelerated High-Performance Compute ( HPC ), ... of state-of-the-art RDMA clusters tailored specifically for AI, ML, HPC workloads. We strive to be the go-to experts...We strive to be the go-to experts in RDMA cluster architecture, leveraging our deep understanding of the unique… more
- NVIDIA (Santa Clara, CA)
- …GPUs, ETH/IB networking components, storage, etc.) within extensive AI and HPC cluster settings. + Practical knowledge of NVIDIA systems technology such as ... with partners and customers. + Experience crafting and deploying large-scale cluster environments. + Practical expertise in data center design, development and… more
- SAIC (Princeton, NJ)
- **Description** SAIC is recruiting for a ** Senior Linux Systems Administrator** to join our program as a general technologist to support our NOAA GFDL program in ... cluster . + Experience managing software scalable network file systems and centralized storage appliances. + Experience supporting enterprise logon with… more
- NVIDIA (Santa Clara, CA)
- …team architects and develops the end to end software and firmware stack for these systems . We are looking for a Senior Software Architect who has deep expertise ... , particularly at the SW/HW interface. + Understanding of HPC or Deep learning workloads and use of accelerated...out from the crowd: + Knowledge of cloud and cluster level deployment and management systems . +… more
- NVIDIA (CO)
- …server architecture. In-depth understanding of the different deployment models for GPUs (eg, HPC , AI cluster , single- or multi-GPU servers). + Experience in Data ... in system software security to join the Data Center Systems Software team. In this role, you will focus...role, you will focus on securing NVIDIA's Data Center Systems . NVIDIA is leading the way in groundbreaking developments… more
- NVIDIA (Santa Clara, CA)
- …crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times. What You'll Be Doing: + Develop AI ... We are now looking for a Senior Software Engineer for AI Resiliency. At NVIDIA,...+ Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance… more
- NVIDIA (Santa Clara, CA)
- …and telemetry frameworks. + Familiarity with GPU computing (CUDA), large-scale AI/ HPC workloads, NVLink, Grace, and cluster -level deployment/management. + ... NVIDIA is seeking a Senior Manager to lead our System Software SWAT...on data center platforms such as GB200/GB300 and next‑generation systems . This elite, cross‑functional group is the rapid‑response hub… more
- NVIDIA (Santa Clara, CA)
- NVIDIA is searching for a senior or principal engineer who specializes in building cutting-edge infrastructure for large-scale foundation model training in the ... What you will be doing: + Design and maintain large-scale distributed training systems to support multi-modal foundation models for robotics. + Optimize GPU and … more