This job is in your area. Enjoy a short commute and work close to home.
Job Description
We are looking for an experienced L2 Engineer to operate and support high-performance AI infrastructure platforms, including NVIDIA GPU clusters, InfiniBand fabrics, and Kubernetes-based IaaS environments.
This role focuses on deep infrastructure expertise, ensuring performance, scalability, and reliability of the platform layer that powers AI workloads — without being responsible for the workloads themselves.
You will play a key role in bare metal lifecycle management, advanced InfiniBand troubleshooting, and platform stability, working closely with engineering teams to operate cutting-edge infrastructure at scale.
Key responsibilities:
- Troubleshoot and maintain InfiniBand fabrics, including performance tuning, link issues, and topology validation.
- Act as the escalation point for L1 for complex infrastructure and hardware issues.
- Own and maintain accurate infrastructure modeling, IPAM, and source...