ML Safety Research Engineer
Apple • San Francisco, CA
January 6, 2026About the Role
Join Apple Services Engineering to lead the design and continuous development of automated safety benchmarking methodologies for AI/ML models. This role focuses on investigating how media-related agents behave, developing rigorous evaluation frameworks, and establishing scientific standards for assessing risks and safety performance. You'll work on scalable evaluation techniques that ensure Apple's AI tools and models across App Store, Music, Video, and more are safe, reliable, and aligned with human expectations. Build capabilities that allow for the generation of benchmark datasets and evaluation methodologies at scale, collaborating with cross-functional teams including Engineering, Product, and Governance.
Responsibilities
- Design scientifically-grounded benchmarking methodologies covering multiple dimensions of responsibility and safety across several media and application marketplace use cases.
- Develop automated evaluation pipelines that collect, automatically judge, and analyze model outputs with respect to safety policies, at scale.
- Create and curate datasets, tasks, and feature usage scenarios that represent realistic and adversarial use cases across multiple languages, markets, and domains.
- Define and validate new metrics for complex phenomena such as multi-turn agentic interaction patterns.
- Apply statistical rigor and reproducibility to evaluation methodologies.
- Work closely with engineering and research teams to translate experimental findings into actionable model improvements and safety mitigations.
- Publish internal reports and external papers.
- Monitor evolving industry practices and academic work to ensure benchmarks remain relevant.
Requirements
- Advanced degree (MS or PhD) in Computer Science, Software Engineering, or equivalent research/work experience.
- 1+ years of work experience either as a postdoc or in the industry.
- Strong research background in empirical evaluation, experimental design, or benchmarking.
- Strong proficiency in Python (pandas, NumPy, Jupyter, PyTorch, etc.).
- Deep familiarity with software engineering workflows and developer tools.
- Experience working with or evaluating AI/ML models, preferably LLMs or program synthesis systems.
- Strong analytical and communication skills, including the ability to write clear reports.
- Experience working with large datasets, annotation tools, and model evaluation pipelines.
- Familiarity with evaluations specific to responsible AI and safety, hallucination detection, and/or model alignment concerns.
- Ability to design taxonomies, categorization schemes, and structured labeling frameworks.
- Ability to interpret unstructured data (text, transcripts, user sessions) and derive meaningful insights.
- Education in Data Science, Linguistics, Cognitive Science, HCI, Psychology, Social Science, or a related field.
The 2 Most Powerful Things That Get You High-Paying Developer Jobs
Watch on YouTube