Course: CS 695 - Advanced Topics in AI Safety
Instructor: [Your Name]
Email: [your.email@university.edu]
Time: Tuesdays & Thursdays, 2:00-3:30 PM
Location: Room 305, Computer Science Building
Office Hours: Wednesdays 3:00-5:00 PM or by appointment
This seminar explores current research in AI safety, including alignment, robustness, interpretability, and governance. Students will present and discuss recent papers from leading conferences and journals. The course aims to provide a comprehensive understanding of the technical and conceptual challenges in ensuring AI systems are safe and beneficial.
Each class session will feature student-led presentations of research papers followed by group discussion. Students are expected to:
| Date | Topic | Paper(s) | Presenter |
|---|---|---|---|
| Jan 14 | Introduction & Overview | Course overview and paper assignment | Instructor |
| Jan 16 | AI Alignment Fundamentals | Concrete Problems in AI Safety (Amodei et al., 2016) | Instructor |
| Jan 21 | Reward Modeling | Learning to Summarize with Human Feedback (Stiennon et al., 2020) | TBD |
| Jan 23 | Constitutional AI | Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) | TBD |
| Jan 28 | Interpretability I | Zoom In: An Introduction to Circuits (Olah et al., 2020) | TBD |
| Jan 30 | Interpretability II | Towards Monosemanticity (Cunningham et al., 2023) | TBD |
| Feb 4 | Robustness & Adversarial Examples | Adversarial Examples Are Not Bugs, They Are Features (Ilyas et al., 2019) | TBD |
| Feb 6 | Scalable Oversight | Supervising Strong Learners by Amplifying Weak Experts (Christiano et al., 2018) | TBD |
| ... Additional sessions to be scheduled ... | |||
A comprehensive reading list will be provided in the first class. Students are encouraged to suggest additional papers throughout the semester. Priority will be given to:
All work submitted must be original. Proper citation of sources is required for all assignments. Plagiarism will result in course failure.