About Us
At .omics, we build foundation models for plant biology — turning genomic and multi-omics data into tools for trait discovery and predictive breeding. We run a large internal GPU cluster for training and serving our models.
The Role
We're hiring a founding MLOps Engineer to make sure our models run efficiently at scale. You'll own the infrastructure for training, deployment, and monitoring of large AI models — optimizing workflows across GPUs and cloud, and ensuring reproducibility so research translates into real-world impact.
What You'll Do
- Build and maintain end-to-end ML pipelines (data → training → deployment → monitoring).
- Manage and optimize GPU cluster infrastructure: job scheduling (SLURM), resource utilization, observability.
- Identify and resolve bottlenecks across the training stack — including data loading, I/O, and compute utilization — to ensure infrastructure never limits model development.
- Improve SLURM workflows to maximize cluster efficiency, including running evaluation and benchmarking jobs alongside training runs.
- Implement efficient hyperparameter tuning workflows designed to scale with model size.
- Support model deployment on platforms like Hugging Face and Azure AI Foundry.
- Build benchmarking systems to evaluate model performance across diverse biological datasets.
- Define infrastructure and data requirements so lab-generated datasets are ML-ready.
- Continuously improve CI/CD, reproducibility, monitoring, and deployment practices.
What We're Looking For
Must have:
- Experience with ML pipelines and infrastructure on GPU-heavy workloads
- Familiarity with MLOps tooling (MLflow, Docker, Kubernetes, Airflow/Kubeflow)
- Strong Python/Bash and software engineering skills
- Comfortable in cloud and/or HPC environments