About us:
🌱 At .omics, we’re building foundation models for plant biology — turning genomic and multi‑omics data into the next generation of tools for trait discovery and predictive breeding. Our goal is to develop crops that can better withstand pests, viruses, and climate stress, helping agriculture adapt to the challenges of a changing environment.
We operate a large internal GPU cluster dedicated to training and serving our models, giving the team the compute needed to iterate quickly and push the limits of model scale and complexity.
Position Overview
We’re looking for a founding ML Ops Engineer to make sure our models don’t just get built, but also run efficiently at scale. You’ll design and maintain the infrastructure for training, deployment, and monitoring of large AI models, optimize workflows across GPUs and cloud environments, and ensure reproducibility and scalability so that our research can translate into real-world applications in trait discovery and breeding.
The ideal candidate will be passionate about creating scalable, efficient, and reliable machine learning pipelines and infrastructure, empowering the research team to develop cutting-edge AI models that drive agricultural innovation.
You will be expected to:
- Build and manage end-to-end machine learning pipelines, from data collection and model training to deployment and monitoring.
- Collaborate with AI researchers and data scientists to streamline experimentation, testing, and validation of AI models.
- Ensure the operationalization of AI models into production systems with a focus on performance, scalability, and reliability.
Key Responsibilities
- Develop, optimize, and maintain scalable and reproducible ML pipelines for model training and deployment.
- Collaborate with AI researchers to support the development of our in-house models and their deployment on open-source platforms such as Hugging Face and Azure AI Foundry.
- Design, manage, and optimize infrastructure for large‑scale training and inference on our GPU cluster, including job scheduling, resource utilization, and observability.
- Build benchmarking systems to measure model performance and optimize transfer learning and generalization across diverse biological datasets.
- Design and manage infrastructure for handling large‑scale biological datasets, ensuring seamless integration and processing for ML tasks.
- Define infrastructure and data requirements for lab‑generated datasets to ensure they are ML‑ready.
- Ensure optimal use of cloud and/or HPC infrastructure to support both research and production workloads.