• AI / HPC Systems

    Meta (Menlo Park, CA)
    …fabric and host networking, comms lib and scheduling infrastructure. **Required Skills:** AI / HPC Systems Performance Engineer Responsibilities: 1. ... **Summary:** Meta's AI Training and Inference Infrastructure is growing exponentially...a loss-less fabric interconnect with minimal latency. To improve performance of these systems we constantly look… more
    Meta (10/27/24)
    - Save Job - Related Jobs - Block Source
  • AI / HPC Systems

    Meta (Menlo Park, CA)
    …fabric and host networking, comms lib and scheduling infrastructure. **Required Skills:** AI / HPC Systems Performance Engineer Responsibilities: 1. ... **Summary:** Meta's AI Training and Inference Infrastructure is growing exponentially...workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look… more
    Meta (10/31/24)
    - Save Job - Related Jobs - Block Source
  • Production Systems Engineer, Sustaining

    Meta (Menlo Park, CA)
    …hardware and software components, co-design 15. Experience in developing or debugging AI / HPC systems , performance optimizations, including familiarity ... or supporting production hardware at scale 9. Experience in deploying and productionizing AI / HPC systems and/or related components at scale 10. Experience in… more
    Meta (10/24/24)
    - Save Job - Related Jobs - Block Source
  • AI / HPC Network Engineer

    Meta (Menlo Park, CA)
    …of RDMA workloads that expects a loss-less fabric interconnect. To enhance the performance of these systems , we continuously seek opportunities for improvement ... host networking, communication libraries, and scheduling infrastructure. **Required Skills:** AI / HPC Network Engineer Responsibilities: 1. Design, develop, test… more
    Meta (12/03/24)
    - Save Job - Related Jobs - Block Source
  • Product Manager, AI / HPC

    Meta (Menlo Park, CA)
    Performance Compute organization, you will define and develop the compute, storage, and AI systems that are deployed across our global data center fleet and ... and requirements needed for future hardware platforms **Required Skills:** Product Manager, AI / HPC Responsibilities: 1. Establish a shared vision and strategy… more
    Meta (12/05/24)
    - Save Job - Related Jobs - Block Source
  • Research Scientist, Systems ML…

    Meta (Menlo Park, CA)
    …Meta and externally. **Required Skills:** Research Scientist, Systems ML and HPC - SW/HW Co-Design Responsibilities: 1. Apply High- Performance Computing ( ... Performance team is dedicated to maximizing training performance of Generative AI and recommendation models...HPC ) algorithms and techniques to optimize large-scale AI workloads 2. Analyze, benchmark, and optimize large-scale workloads… more
    Meta (12/03/24)
    - Save Job - Related Jobs - Block Source
  • Research Scientist Intern, Systems ML…

    Meta (Menlo Park, CA)
    …our research, visit https:// ai .facebook.com. **Required Skills:** Research Scientist Intern, Systems ML and HPC - SW/HW Co-Design Responsibilities: 1. ... team's mission is to explore, develop and help productize high- performance software and hardware technologies for AI ...infrastructure.Meta is seeking Research Scientist Interns to join our AI & Systems Co-Design Training team to… more
    Meta (01/07/25)
    - Save Job - Related Jobs - Block Source
  • Sr. Software Development Engineer, HPC /ML…

    Amazon (Cupertino, CA)
    HPC network fabric or machine learning accelerator cluster systems . Also applicable is experience high-frequency trading networking, high-speed wireless ... team focuses on building networking solutions that for Machine Learning (ML) and High- Performance Computing ( HPC ) workloads on AWS. Working at Annapurna Labs… more
    Amazon (12/20/24)
    - Save Job - Related Jobs - Block Source
  • Technical Program Manager, AI Network Infra

    Meta (Menlo Park, CA)
    AI product introductions and AI operations initiatives supporting Meta's growing AI / HPC infrastructure for our Family of Apps . They will be responsible ... ground-breaking solutions & technologies. The ideal candidate will have experience in AI / HPC product development and operations, strong understanding of the… more
    Meta (11/13/24)
    - Save Job - Related Jobs - Block Source
  • Software Engineer, SystemML - AI Networking

    Meta (Menlo Park, CA)
    …following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems , AI infrastructure, high performance computing, ... large-scale GPU training and inference fleet through an observable, reliable and high- performance distributed AI /GPU communication stack. Currently, one of the… more
    Meta (12/20/24)
    - Save Job - Related Jobs - Block Source
  • Sr. Hardware Dev Engineer (AWS Generative…

    Amazon (Cupertino, CA)
    …operating AWS cloud offerings that enable high performance and scalability in AI /ML and HPC workloads. AWS Infrastructure Services owns the design, planning, ... Want to do industry leading work delivering continuous price performance improvements in the cloud for AI ...the current customer experience as well as developing improved systems for future designs. You will work directly with… more
    Amazon (12/24/24)
    - Save Job - Related Jobs - Block Source
  • GPU Cluster System Administrator

    Stanford University (Stanford, CA)
    …+ CUDA + System and storage benchmarking + DataDirect Networks (DDN) SFA high- performance storage systems **Working Conditions** This is a hybrid position, in ... projects. This new position will support Stanford's world-class data science and AI -focused research by managing and administering an NVIDIA DGX SuperPod instrument.… more
    Stanford University (12/14/24)
    - Save Job - Related Jobs - Block Source
  • Software Engineer, Systems ML - PyTorch…

    Meta (Menlo Park, CA)
    …computer science or related field. 8. Research or industry experience in compilers, ML systems , ML accelerators, HPC , GPU performance , and similar. 9. ... Our work is open-source, cutting-edge, and industry-leading. **Required Skills:** Software Engineer, Systems ML - PyTorch Compiler / ML Framework / Performance more
    Meta (11/07/24)
    - Save Job - Related Jobs - Block Source
  • Sr. Software Development Engineer, ML…

    Amazon (Cupertino, CA)
    …are used to guarantee top performance of AWS ML and High Performance Computing ( HPC ) technologies developed by our organization. Bring your exceptional ... knowledge of CI/CD automation, ML and HPC benchmarks and applications to bear on the cutting-edge...Join us as we expand the AWS offerings for AI , including Trainium, Graviton and the Elastic Fabric Adapter… more
    Amazon (11/16/24)
    - Save Job - Related Jobs - Block Source
  • Hardware Systems Engineer, NPI

    Meta (Menlo Park, CA)
    …years of experience with one subset of the following AI systems : Accelerator (GPU/ASIC), Kernel development, Performance optimization (eg, NVIDIA, AMD, ... productizing high- performance software and hardware technologies for AI at datacenter scale. Hardware Systems Engineer...Intel, or other misc accelerator), computer architecture, HPC communication libraries (eg, NCCL, MPI), performance more
    Meta (10/18/24)
    - Save Job - Related Jobs - Block Source
  • Senior Software Engineer- GPU

    Microsoft Corporation (Mountain View, CA)
    …the intersection of AI algorithmic innovation, purpose-built AI hardware, systems , and software. We own inference performance of OpenAI and other state ... and tools + Speeding up/reducing complexity of key components/pipelines to improve performance and/or efficiency of our systems + Communicate and collaborate… more
    Microsoft Corporation (12/21/24)
    - Save Job - Related Jobs - Block Source
  • Senior GenAI Specialist Solutions Architect,…

    Amazon (San Francisco, CA)
    …modernizing customer requirements to the cloud - Practical experience in High Performance Computing ( HPC ) and/or distributed training, performance profiling ... Description Are you passionate about Generative AI (GenAI)? Do you want to help define...services to power their businesses. We're continuously raising our performance bar as we strive to become Earth's Best… more
    Amazon (11/16/24)
    - Save Job - Related Jobs - Block Source
  • Sr WW Specialist Solutions Architect, Research…

    Amazon (San Francisco, CA)
    …experience - 5+ years building or optimizing computational applications for large scale HPC systems (eg physics based simulations) to take advantage of high ... of Go to Market (GTM) at AWS using generative AI (GenAI)? AWS Sales, Marketing, and Global Services (SMGS)...years building or optimizing computational applications for large scale HPC systems (eg physics based simulations) to… more
    Amazon (11/20/24)
    - Save Job - Related Jobs - Block Source
  • Hardware Engineer (Accelerator Design)

    Meta (Menlo Park, CA)
    …and high-speed interfaces. 8. 5+ years of experience with development and bring-up of high- performance custom systems or complex consumer devices. 9. 6+ years of ... for high-power, high-speed PCBs 20. Delivery of production-level PCBs for HPC , AI , or other demanding workloads **Public Compensation:** $170,000/year… more
    Meta (11/26/24)
    - Save Job - Related Jobs - Block Source
  • Research Engineer, Pytorch Distributed (PhD)

    Meta (Menlo Park, CA)
    …in Python, C++ or CUDA programming. 10. Research or industry experience in ML systems , ML accelerators, HPC , GPU performance , and similar. 11. Currently ... PyTorch. 14. Expert knowledge in GPU performance and writing high- performance communication libraries and fault tolerance distributed systems . 15. Proven… more
    Meta (10/24/24)
    - Save Job - Related Jobs - Block Source