- Meta (Menlo Park, CA)
- …fabric and host networking, comms lib and scheduling infrastructure. **Required Skills:** AI / HPC Systems Performance Engineer Responsibilities: 1. ... **Summary:** Meta's AI Training and Inference Infrastructure is growing exponentially...workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look… more
- Meta (Menlo Park, CA)
- …hardware and software components, co-design 15. Experience in developing or debugging AI / HPC systems , performance optimizations, including familiarity ... or supporting production hardware at scale 9. Experience in deploying and productionizing AI / HPC systems and/or related components at scale 10. Experience in… more
- NVIDIA (Santa Clara, CA)
- …designing and operating large scale storage infrastructure. + Experience analyzing and tuning performance for a variety of AI / HPC workloads. + Experience ... join us today! As a member of the GPU AI / HPC Infrastructure team, you will provide leadership...solutions to enable runs of demanding deep learning, high performance computing, and computationally intensive workloads. We seek an… more
- NVIDIA (Santa Clara, CA)
- …group at NVIDIA has openings for software architects in the field of AI and high- performance networking and system software. We research, develop, and ... be doing + Creating proofs-of-concept to evaluate and motivate extensions in AI Frameworks (PyTorch/NEMO), HPC programming models (MPI, OpenSHMEM, PGAS), new… more
- Meta (Menlo Park, CA)
- …Meta and externally. **Required Skills:** Research Scientist, Systems ML and HPC - SW/HW Co-Design Responsibilities: 1. Apply High- Performance Computing ( ... Performance team is dedicated to maximizing training performance of Generative AI and recommendation models...HPC ) algorithms and techniques to optimize large-scale AI workloads 2. Analyze, benchmark, and optimize large-scale workloads… more
- Amazon (Santa Clara, CA)
- …computing and its potential to overcome some of the biggest challenges in High Performance Computing ( HPC )? Do you have a unique combination of deep technical ... C++, Python, CUDA, Bash - Deep GPU knowledge in HPC and/or AI /ML frameworks. Preferred Qualifications -...life sciences or related discipline. - Working knowledge of HPC schedulers and distributed/parallel file systems , underlying… more
- NVIDIA (Santa Clara, CA)
- …the world. We are looking for an outstanding engineer for a Senior HPC Systems Engineer role for at scale AI system performance and datacenter ... develop new, leading differentiated solutions. You will interact with HPC , OS, CPU and GPU compute, and systems...debugging and resolving critical software issues for the best AI workload performance at scale. + Specific… more
- NVIDIA (Santa Clara, CA)
- …vision? What you will be doing: + Investigate opportunities to improve communication performance by identifying bottlenecks in today's systems . + Design and ... implement new communication technologies to accelerate AI and HPC workloads. + Explore innovative solutions in HW and SW for our next generation platforms as… more
- NVIDIA (Santa Clara, CA)
- …improved workflows and develop new, leading differentiated solutions. You will interact with HPC , OS, GPU compute, and systems specialist to architect, develop ... parallel computing. More recently, GPU deep learning ignited modern AI - the next era of computing. NVIDIA is...looking for an outstanding hands-on architect/engineer for a Senior HPC architect role to support deployment and bringup of… more
- NVIDIA (Santa Clara, CA)
- …long term maintenance strategy. What you'll be doing: + Design highly available and scalable systems to meet the demands of our HPC clusters + Evaluate new and ... graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI and enabled the next era of computing. NVIDIA is a "learning… more
- NVIDIA (Santa Clara, CA)
- … infrastructure + Passion for solving complex technical challenges and optimizing system performance + Experience with AI / HPC advanced job schedulers, and ... support operational and reliability aspects of large scale distributed systems with focus on performance at scale,...storage systems like Lustre and GPFS for AI / HPC workloads + Familiarity with deep learning… more
- NVIDIA (Santa Clara, CA)
- …expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times. What You'll Be ... to embed AI resilience features into their AI frameworks, ensuring seamless integration and optimal performance...or related fields, with a deep understanding of distributed systems and large-scale AI infrastructure. + At… more
- NVIDIA (Santa Clara, CA)
- …experience in performance optimization and benchmarking on large-scale distributed systems + Hands-on experience with NVIDIA GPUs, HPC storage, networking, ... NVIDIA is an industry leader with groundbreaking developments in High- Performance Computing, Artificial Intelligence and Visualization. The GPU, our invention,… more
- Meta (Menlo Park, CA)
- …following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems , AI infrastructure, high performance computing, ... large-scale GPU training and inference fleet through an observable, reliable and high- performance distributed AI /GPU communication stack. Currently, one of the… more
- NVIDIA (Santa Clara, CA)
- …GH200 superchip provides performance and productivity required for strong scaling for HPC and generative AI workload.Scale out is inherent to design of this ... the world. Today, we are increasingly known as "the AI computing company." We're looking to grow our company...issue closure. + Identify new technologies, features to improve performance , functionality, uptime of GPU systems to… more
- Meta (Menlo Park, CA)
- …in high- performance computation. **Required Skills:** Engineering Manager, PyTorch - AI Acceleration Responsibilities: 1. Grow a team of domain experts within ... **Summary:** AI Acceleration is an org within PyTorch. It's...should have strong technical skills - GPU / ML Systems knowledge is preferred, though not required. We work… more
- NVIDIA (Santa Clara, CA)
- …in their fields (industry and academia) to perform in-depth analysis and optimization of complex AI and HPC algorithms to ensure the best possible AI ... Artificial Intelligence Would you enjoy researching parallel algorithms to accelerate AI workloads on advanced computer architectures? Is it rewarding to… more
- Amazon (Cupertino, CA)
- …operating AWS cloud offerings that enable high performance and scalability in AI /ML and HPC workloads. AWS Infrastructure Services owns the design, planning, ... Want to do industry leading work delivering continuous price performance improvements in the cloud for AI ...the current customer experience as well as developing improved systems for future designs. You will work directly with… more
- Cisco (San Jose, CA)
- …team engaged in the design, development and execution of tests to qualify network performance for AI .ML capability. In this role you'll have opportunity to: * ... to build the next generation infrastructure to meet the needs of AI /ML workloads and continuously increasing internet users and application. We are uniquely… more
- Cisco (San Jose, CA)
- …team engaged in the design, development and execution of tests to qualify network performance for AI .ML capability. In this role you'll have opportunity to: * ... to build the next generation infrastructure to meet the needs of AI /ML workloads and continuously increasing internet users and application. We are uniquely… more