Hpc Reliability Engineer Ai Jobs

68 jobs (page 1)

Categories

All Categories

Engineering (33)

Software/IT (13)

AI and ML HPC Cluster…

NVIDIA (Santa Clara, CA)

…that power some of the world's most advanced computing workloads. NVIDIA is looking for an AI /ML HPC Cluster Engineer to join our MARS team. You will provide ... be doing: + Support day-to-day operations of production on-premises and multi-cloud AI / HPC clusters, ensuring system health, user satisfaction, and efficient… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Site Reliability Engineer…

NVIDIA (Santa Clara, CA)

…foundational improvements and automation to improve engineer 's productivity. As a Site Reliability Engineer , you are responsible for the big picture of how ... fueled by great technology-and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU… more

NVIDIA (01/13/26)
- Save Job - Related Jobs - Block Source
HPC Sr. Scientific Software Engineer…

Johns Hopkins University (Baltimore, MD)

…and Design** + Develop and refine deployment strategies for scientific software on HPC and AI systems. + Design computational workflows, selecting optimal ... AI Agents). _Performance Optimization_ + Analyze and optimize the performance of AI models and HPC applications, focusing on GPU-enabled computing. +… more

Johns Hopkins University (11/21/25)
- Save Job - Related Jobs - Block Source
HPC Scientific Software Engineer…

Johns Hopkins University (Baltimore, MD)

…and Design_ + Develop and refine deployment strategies for scientific software on HPC and AI systems. + Design computational workflows, selecting optimal ... _Performance Optimization_ + Analyze and optimize the performance of AI models and HPC applications, focusing on...fields, with advanced training in scientific computing. Classified Title: HPC Scientific Software Engineer Job Posting Title… more

Johns Hopkins University (12/04/25)
- Save Job - Related Jobs - Block Source
HPC / AI Platform Engineering

Lilly (Indianapolis, IN)

…Bold** - You will bring a high learning agility and Infrastructure availability and reliability Engineer skills to help us enable the Lilly Technology strategy, ... the world. Come help us unlock the power of HPC and AI based POGPU and Accelerated...Additionally, you would advise with our senior Linux platform engineer directing the global Linux strategy for on-premises private… more

Lilly (11/27/25)
- Save Job - Related Jobs - Block Source
Staff Software Engineer , HPC…

Google (Kirkland, WA)

Staff Software Engineer , HPC Solutions _corporate_fare_ Google _place_ Kirkland, WA, USA **Advanced** Experience owning outcomes and decision making, solving ... future of scientific computing by leading the convergence of AI and HPC . The AI ...Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability… more

Google (12/20/25)
- Save Job - Related Jobs - Block Source
Senior HPC and Quantum Systems…

NVIDIA (Westford, MA)

…how you can make a lasting impact on the world. We are seeking a Senior HPC & Quantum Systems Engineer to help architect, deploy, and operate a first-of-its-kind ... people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An...is not a pure research role nor a traditional HPC admin role-it is a systems engineering position dedicated… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Software Engineer , HPC , Platform…

Google (Kirkland, WA)

Software Engineer , HPC , Platform Readiness, Workload Performance _corporate_fare_ Google _place_ Kirkland, WA, USA **Advanced** Experience owning outcomes and ... on and is growing every day. As a software engineer , you will work on a specific project critical...Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability… more

Google (12/18/25)
- Save Job - Related Jobs - Block Source
Staff Engineer , Systems HPC

Micron Technology, Inc. (Richardson, TX)

…intelligence, inspiring the world to learn, communicate and advance faster than ever. As an HPC Staff Engineer at Micron, you will join a diverse team of ... You will play a key part in maintaining the reliability and efficiency of Micron's data environment. **Responsibilities** +...from candidates as consideration for their employment with Micron. AI alert **:** Candidates are encouraged to use … more

Micron Technology, Inc. (12/09/25)
- Save Job - Related Jobs - Block Source
Senior GPU and HPC Infrastructure…

NVIDIA (Santa Clara, CA)

NVIDIA is hiring engineers to scale up its AI Infrastructure. We expect you to have a strong programming background, knowledge of datacenter hardware, operations, ... and planning abilities. Experience working with High Performance Computing ( HPC ), GPUs, and high-performance networking (RDMA, Infiniband, RoCE) are strongly… more

NVIDIA (01/08/26)
- Save Job - Related Jobs - Block Source
Staff Quality and Reliability…

Google (Sunnyvale, CA)

…architecture and its integration within AI /ML-driven systems. As a Quality and Reliability Engineer for Google Cloud, you will lead the development of ... Staff Quality and Reliability Engineer , Google Cloud _corporate_fare_ Google...Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability… more

Google (12/30/25)
- Save Job - Related Jobs - Block Source
Senior Software Engineer - AI…

Bloomberg (New York, NY)

…for overseeing the ongoing monitoring, support, and maintenance of our HPC / AI clusters, ensuring peak performance and reliability . **We'll trust you to:** ... Senior Software Engineer - AI Hardware Location New York...ongoing monitoring, support, and maintenance of our HPC / AI clusters, ensuring peak performance and reliability … more

Bloomberg (12/18/25)
- Save Job - Related Jobs - Block Source
MTS - Site Reliability Engineer

Microsoft Corporation (Redmond, WA)

…so that everyone can realize its benefits. We're looking for an experienced **Site Reliability Engineer (SRE)** to join our infrastructure team. In this role, ... **Overview** As Microsoft continues to push the boundaries of AI , we are on the lookout for passionate individuals to work with us on the most interesting and… more

Microsoft Corporation (12/17/25)
- Save Job - Related Jobs - Block Source
Principal Mechanical Reliability…

Dell Technologies (Austin, TX)

**Principal Mechanical Reliability Engineer ** Mechanical Engineering leads and delivers the development of innovative and compliant mechanical design solutions, ... make a profound social impact as a **Principal Mechanical Reliability Engineer ** on our Mechanical **Engineering** Team...be instrumental in delivering advanced liquid cooling solutions for AI , HPC , and enterprise server markets. Your… more

Dell Technologies (11/19/25)
- Save Job - Related Jobs - Block Source
Sr Principal Software Engineer , Networking…

Oracle (Cheyenne, WY)

… AI Infrastructure Innovation team is pioneering the creation of next-generation AI / HPC networking for GPU superclusters at massive scale. Our mission is ... system design, and implementation for high-performance RDMA solutions across OCI's AI / HPC platforms, including frontend and backend fabrics. + Innovate… more

Oracle (12/20/25)
- Save Job - Related Jobs - Block Source
AI /ML Infrastructure Engineer

Oracle (Lincoln, NE)

…solutions across Oracle's enterprise customers. We are seeking a highly skilled ** AI /ML Infrastructure Engineer ** to design, build, and support the systems, ... troubleshooting, and best practices. + Stay current with emerging trends in AI infrastructure, agent frameworks, HPC systems, and cloud-native technologies;… more

Oracle (01/13/26)
- Save Job - Related Jobs - Block Source
Principal Network Engineer - DC…

NVIDIA (Santa Clara, CA)

…a passionate engineer who will solve networking problems for scalable AI clusters. This is a hands-on network engineering position focused on the architecture, ... and deployment of global-scale DCs inter-connects and fabric for HPC , AI , and GPU computing clusters. +...reliability . + Partner with system, OS, GPU, and HPC teams to deliver scalable, highly available networks for… more

NVIDIA (01/10/26)
- Save Job - Related Jobs - Block Source
Senior Principal Software Development…

Oracle (Springfield, IL)

…Forward Deployed Engineer (FDE) team is hiring a Senior Principal Software Development Engineer - AI Data Platform to help global customers unlock the full ... to streamline the adoption of Oracle AI Data Platform and Gen AI services. + Optimize performance, scalability, and reliability of distributed data/ AI… more

Oracle (01/11/26)
- Save Job - Related Jobs - Block Source
Consulting Member of Technical Staff - AI…

Oracle (Santa Clara, CA)

…and debug software programs for databases, applications, tools, networks etc.As an AI /ML Infrastructure Engineer on the GPU Strategic Customers Engineering team, ... or Scala + Proven experience designing, implementing, and managing infrastructure for AI /ML or HPC workloads. + Understanding machine learning frameworks and… more

Oracle (12/05/25)
- Save Job - Related Jobs - Block Source
Senior Principal Software Engineer…

Oracle (Austin, TX)

…automation, and diagnostic services. These are essential for running distributed AI /ML/ HPC workloads across thousands of GPUs, leveraging technologies like ... looking for a highly skilled and motivated distributed systems engineer who can architect solutions to scale and optimize...to scale and optimize Monitoring and Repair solutions for AI infrastructure components like GPU control plane and GPU… more

Oracle (01/03/26)
- Save Job - Related Jobs - Block Source

"Juju

Account Login

Sign Up

Forgot your password?

Advanced Search