AI Safety Seminar - Spring 2025

Course: CS 695 - Advanced Topics in AI Safety

Instructor: [Your Name]

Email: [your.email@university.edu]

Time: Tuesdays & Thursdays, 2:00-3:30 PM

Location: Room 305, Computer Science Building

Office Hours: Wednesdays 3:00-5:00 PM or by appointment

Course Description

This seminar explores current research in AI safety, including alignment, robustness, interpretability, and governance. Students will present and discuss recent papers from leading conferences and journals. The course aims to provide a comprehensive understanding of the technical and conceptual challenges in ensuring AI systems are safe and beneficial.

Format

Each class session will feature student-led presentations of research papers followed by group discussion. Students are expected to:

Schedule

Date Topic Paper(s) Presenter
Jan 14 Introduction & Overview Course overview and paper assignment Instructor
Jan 16 AI Alignment Fundamentals Concrete Problems in AI Safety (Amodei et al., 2016) Instructor
Jan 21 Reward Modeling Learning to Summarize with Human Feedback (Stiennon et al., 2020) TBD
Jan 23 Constitutional AI Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) TBD
Jan 28 Interpretability I Zoom In: An Introduction to Circuits (Olah et al., 2020) TBD
Jan 30 Interpretability II Towards Monosemanticity (Cunningham et al., 2023) TBD
Feb 4 Robustness & Adversarial Examples Adversarial Examples Are Not Bugs, They Are Features (Ilyas et al., 2019) TBD
Feb 6 Scalable Oversight Supervising Strong Learners by Amplifying Weak Experts (Christiano et al., 2018) TBD
... Additional sessions to be scheduled ...
Note: Paper assignments will be finalized during the first class. Students should be prepared to select their presentation dates and papers.

Grading

Resources

Reading List

A comprehensive reading list will be provided in the first class. Students are encouraged to suggest additional papers throughout the semester. Priority will be given to:

Academic Integrity

All work submitted must be original. Proper citation of sources is required for all assignments. Plagiarism will result in course failure.