Senior Applied Researcher, Machine Learning
3 months ago
Over the last two decades, GPU acceleration became the standard for scientific computing, with the fastest supercomputers in the world combining CPUs and GPUs. The emergence of Generative AI has created the larger GPU datacenters ever, creating the strongest ever supercomputers. These datacenters are sophisticated and thus challenging to operate. Yet, as an analogy from autonomous vehicle industry, we strive to automate it as much as possible. Therefore, today we are building an outstanding data scientist team to tackle these challenges. The emerging field of Artificial Intelligence for IT operations (AIOps ) strives to apply AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide practical insights with the primary goal of improving availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be used to improve operational efficiency. We categorize the key AIOps tasks as - incident detection (e.g. via Anomaly Detection), failure prediction, root cause analysis and automated actions. Moreover we are also working on optimization problem to increase not only the resiliency but also the performance of the datacenter.
What you'll be doing:
- Explore high-level, undefined ideas and solve real-life problems using structured and unstructured data.
- Craft proof-of-concept rooted in first principles that apply modern data science techniques to operation use cases.
- Collaborate in a multi-disciplinary environment with domain experts in various fields such as networking, high performance computing for AI, telemetry etc.
- Develop a strategic vision for Nvidia networking together with adjacent architects and research groups.
- Define the data pipelines and ML architecture for SaaS for handling hyper scale data problems.
- Support software developers to migrate prototyped to end-to-end pipelines that are suitable for deployment in production environments.
What we need to see:
- M.Sc. or PhD. in Science or Engineering
- 12+ years of relevant experience
- Validated excellent and industry experience in data science or machine learning with a variety of ML/DL algorithms and their application
- Consistent record of staying ahead of technology envelope, understand pioneering research, dabble into new technologies to develop practical applications and generate innovative ideas.
- Great motivation, with strong interpersonal skills and the ability to communicate highly technical concepts with non-technical audiences
- "Can do attitude" - ability to succeed in ambiguous settings where part of the challenge is to define it.
- Strong programming skills in Python (including unit-tests, CI&CD etc), as well as comfort using Linux and typical development tools (e.g., GitHub, Docker)
- Experience in large scale data systems (on-prem and/or cloud).
- Proficiency in deep learning frameworks.
Ways to stand out from the crowd:
- Past senior technical roles such as principle data scientist, team leader, tech lead, head of ML in a startup. Publications in peer-reviewed journals or conferences. Previous real-world experiencing developing models for anomaly detection, predictive forecasting, root-cause-analysis use cases.
- Experience in developing and deploying ML pipelines at large scales (TB+). Beyond supervised learning: optimization using Reinforcement learning and adaptive experimentation. Experience with ML deployment lifecycle including model monitoring and retraining.
- Experienced with networking, cloud, data-center, edge computing technologies.
NVIDIA is committed to encouraging a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.