AI Safety Seminar - Spring 2025

Course: CS 695 - Advanced Topics in AI Safety

Instructor: [Your Name]

Email: [your.email@university.edu]

Time: Tuesdays & Thursdays, 2:00-3:30 PM

Location: Room 305, Computer Science Building

Office Hours: Wednesdays 3:00-5:00 PM or by appointment

Course Description

This seminar explores current research in AI safety, including alignment, robustness, interpretability, and governance. Students will present and discuss recent papers from leading conferences and journals. The course aims to provide a comprehensive understanding of the technical and conceptual challenges in ensuring AI systems are safe and beneficial.

Format

Each class session will feature student-led presentations of research papers followed by group discussion. Students are expected to:

Present 1-2 papers during the semester (depending on enrollment)
Lead the discussion for their presented papers
Submit brief reading responses before each class
Actively participate in all discussions

Schedule

Date	Topic	Paper(s)	Presenter
Jan 14	Introduction & Overview	Course overview and paper assignment	Instructor
Jan 16	AI Alignment Fundamentals	Concrete Problems in AI Safety (Amodei et al., 2016)	Instructor
Jan 21	Reward Modeling	Learning to Summarize with Human Feedback (Stiennon et al., 2020)	TBD
Jan 23	Constitutional AI	Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)	TBD
Jan 28	Interpretability I	Zoom In: An Introduction to Circuits (Olah et al., 2020)	TBD
Jan 30	Interpretability II	Towards Monosemanticity (Cunningham et al., 2023)	TBD
Feb 4	Robustness & Adversarial Examples	Adversarial Examples Are Not Bugs, They Are Features (Ilyas et al., 2019)	TBD
Feb 6	Scalable Oversight	Supervising Strong Learners by Amplifying Weak Experts (Christiano et al., 2018)	TBD
... Additional sessions to be scheduled ...

Note: Paper assignments will be finalized during the first class. Students should be prepared to select their presentation dates and papers.

Grading

Paper Presentation(s): 40%
Reading Responses: 20%
Class Participation: 20%
Final Project/Essay: 20%

Resources

Alignment Forum - Key discussions and posts on AI alignment
Distill - Clear explanations of machine learning concepts
arXiv AI - Latest AI research papers
LessWrong AI Safety - Curated AI safety content
80,000 Hours AI Safety Guide - Career guide for AI safety

Reading List

A comprehensive reading list will be provided in the first class. Students are encouraged to suggest additional papers throughout the semester. Priority will be given to:

Recent papers from NeurIPS, ICML, ICLR, and other top ML conferences
Papers from AI safety-specific venues (AI Safety Workshop, etc.)
Influential blog posts and technical reports from major AI labs

Academic Integrity

All work submitted must be original. Proper citation of sources is required for all assignments. Plagiarism will result in course failure.