Introduction to AI Safety (public presentation).pptx

Introduction to AI Safety
Aryeh L. Englander
AMDS / A4I

What do we mean by Technical AI Safety?
3
• Critical systems: systems whose failure may lead to injury or loss of life, damage to the
environment, unauthorized disclosure of information, or serious financial losses
• Safety-critical systems: systems whose failure may result in injury, loss of life, or
serious environmental damage
• Technical AI safety: designing safety-critical AI systems (and more broadly, critical AI
systems) in ways that guard against accident risks – i.e., harms arising from AI systems
behaving in unintended ways
Sources:
- Ian Sommerville, supplement to Software Engineering (10th edition)
- Remco Zwetsloot and Allan Dafoe, “Thinking About Risks From AI:
Accidents, Misuse and Structure”

4
Other related concerns
• Security against exploits by adversaries
- Often considered part of AI Safety
• Misuse from people using AI in unethical or
malicious ways
- Ex: deepfakes, terrorism, suppression of dissent
• Machine ethics
- Designing AI systems to make ethical decisions
- Debate over lethal autonomous weapons
• Structural risks from AI shaping the
environment in subtle ways
- Ex: job loss, increased risks of arms races
• Governance, strategy, and policy
- Should government regulate AI?
- Who should be held accountable?
- How do we coordinate with other governments and
stakeholders to prevent risks?
• AI forecasting and risk analysis
- When are these concerns likely to materialize?
- How concerned should we be?
Adversarial examples: fooling AI into
thinking a stop sign is a 45 mph sign
(image source)
(image source)
Potential terrorist use of lethal
fully autonomous drones
(image source, based on a report from the OECD)
Jobs at risk of automation by AI

AI Safety research communities
• Two related research communities: AI Safety, Assured Autonomy
• AI Safety
- Focus on long-term risks from roughly human-level AI or beyond
- Also focused on near-term concerns that may scale up / provide insight into long-term issues
- Relatively new field – past 10 years or so
- Becoming progressively more mainstream
 Many leading AI researchers have expressed strong support for the research
 AI Safety research groups set up at several major universities and AI companies
• Assured Autonomy
- Older, established community with broader focus on assuring autonomous systems in general
- Recently started looking at challenges posed by machine learning
- Current and near-term focus
• In the past year both communities have finally started trying to collaborate and work out a
shared research landscape and vision
• APL’s focus: near- and mid-term concerns, but it would be nice if our research also scales up
to longer-term concerns
5

AI Safety: Lots of ways to frame conceptually
• Many different ways to divide up the problem space, and many different research
agendas from different organizations
• It can get pretty complicated
6
AI Safety Landscape overview from the Future of Life Institute (FLI)
Connections between different research agendas
(Source: Everitt et al, AGI Safety Literature Review)

7
AI Safety: DeepMind’s conceptual framework
Source: DeepMind Safety Research Blog

Assured Autonomy: AAIP conceptual framework
8
Source: Ashmore et al., Assuring the Machine Learning Lifecycle
AAIP = Assuring Autonomy International Programme (University of York)

Combined framework
• This is the proposed framework for
combining AI Safety and Assured Autonomy
research communities
• Also tries to address relevant topics from the
AI Ethics, Security and Privacy communities
• Until now these communities haven’t been
talking to each other as much as they
should
• Still in development; AAAI 2020 has a full-
day workshop on this
• Personal opinion: I like that it’s general, but I
think it’s a bit too general – best used only
for very abstract overviews of the field
9
= focus of AI Safety / Deepmind framework
= focus of Assured Autonomy / AAIP framework

My personal preference
Problems that scale up to long term:
DeepMind framework
10
Near-term machine learning:
AAIP framework
+ +
Everything else:
Combined framework

AI safety concerns and APL’s mission areas
• All of APL’s mission areas involve safety- or mission-critical systems
• The military is concerned with assurance rather than safety (obviously, military systems
are unsafe for the enemy), but the two concepts are very similar and involve similar
problems and solutions
• The government is very aware of these problems, and this is part of why the military has
been reluctant to adopt AI technologies
- Recent report from the Defense Innovation Board: primary document, supporting document
- Congressional Report on AI and National Security
- DARPA: Assured Autonomy program, Explainable AI program
• If we want to get the military to adopt the AI technologies we develop here, those
technologies will need to be assured and secure
11

14
Specification problems
• These problems arise when there is a gap (often very subtle
and unnoticed) between what we really want and what the
system is actually optimizing for
• Powerful optimizers can find surprising and sometimes
undesirable solutions for objectives that are even subtly
mis-specified
• Often extremely difficult or impossible to fully specify
everything we really want
• Some examples:
- Specification gaming
- Avoiding side effects
- Unintended emergent behaviors
- Bugs and errors

15
Specification: Specification Gaming
• Agent exploits a flaw in the specification
• Powerful optimizers can find extremely
novel and potentially harmful solutions
• Example: evolved radio
• Example: Coast Runners
• There are many other similar examples
The evolvable motherboard that led to the evolved radio
A reinforcement learning agent discovers an unintended strategy
for achieving a higher score
(Source: OpenAI, Faulty Reward Functions in the Wild)

16
Specification: Specification Gaming (cont.)
• Can be a problem for classifiers as well:
The loss function (“reward”) might not
really be what we care about, and we
may not discover the discrepancy until
later
• Example: Bias
- We care about the difference between humans
and animals more than between breeds of
dogs, but loss function optimizes for all equally
- We only discovered this problem after it
caused major issues
• Example: Adversarial examples
- Deep Learning (DL) systems discovered weird
correlations that humans never thought to look
for, so predictions don’t match what we really
care about
- We only discovered this problem well after the
systems were in use
Google images misidentified black people as gorillas
(source)
Blank labels can make DL systems misidentify stop signs as
Speed Limit 45 MPH signs
(source)

17
Specification: Avoiding side effects
• What we really want: achieve goals
subject to common sense constraints
• But current systems do not have anything
like human common sense
• In any case would not by default
constrain itself unless specifically
programmed to do so
• Problem likely to get much more difficult
going forward:
- Increasingly complex, hard-to-predict
environments
- Increasing number of possible side effects
- Increasingly difficult to think of all those side
effects in advance
Two side effect scenarios
(source: DeepMind Safety Research blog)

Specification: Avoiding side effects (cont.)
• Standard TEV&V approach: brainstorm
with experts "what could possibly go
wrong?"
• In complex environments it might not be
possible to think about all the things that
could go wrong beforehand (unknown
unknowns) until it's too late
• Is there a general method we can use to
guard against even unknown unknowns?
• Ideas in this category
- Penalize changing the environment (example)
- Agent learns constraints by observing humans
(example)
18
Get from point A to point B – but don’t knock over the vase!
Can we think of all possible side effects like this in advance?
(image source)

19
Specification: Other problems
OpenAI’s hide and seek AI agents demonstrated
surprising emergent behaviors (source)
(image source)
• Emergent behaviors
- E.g., multi-agent systems, human-AI teams
- Makes it much more difficult to predict and
verify, which makes a lot of the above
problems worse
• Bugs and errors
- Can be even harder to find and correct logic
errors in complex ML systems (especially Deep
Learning) than in regular software systems
- (See later on TEV&V)

20
Robustness problems
• How to ensure that the system continues to operate within
safe limits upon perturbation
• Some examples:
- Distributional shift / generalization
- Safe exploration
- Security

Robustness: Distributional shift / generalization
• How do we get a system trained on one distribution to perform well and safely if it
encounters a different distribution after deployment?
• Especially, how do we get the system to proceed more carefully when it encounters
safety-critical situations that it did not encounter during training?
• Generalization is a well-known problem in ML, but more work needs to be done
• Some approaches:
- Cautious generalization
- “Knows what it knows”
- Expanding on anomaly detection techniques
21
(image source)

Robustness: Safe exploration
• If an RL agent uses online learning or needs to train in a real-world environment, then the
exploration itself needs to be safe
• Example: A self-driving car can't learn by experimenting with swerving onto sidewalks
• Restricting learning to a controlled, safe environment might not provide sufficient training
for some applications
22
How do we tell a cleaning robot not to experiment with sticking wet
brooms into sockets during training?
(image source)

Robustness: Security
• (Security is sometimes considered part of safety / assurance, and sometimes separate)
• ML systems pose unique security challenges
• Data poisoning: Adversaries can corrupt the training data, leading to undesirable results
• Adversarial examples: Adversaries can use tricks to fool ML systems
• Privacy and classified information: By probing ML systems, adversaries may be able
to uncover private or classified information that was used during training
23
What if an adversary fools an AI into
thinking a school bus is a tank?

24
• (DeepMind calls this Assurance, but that’s confusing since
we’ve also been discussing Assured Autonomy)
• Interpretability: Many ML systems (esp. DL) are mostly
black boxes
• Scalable oversight: It can be very difficult to provide
oversight of increasingly autonomous and complex agents
• Human override: We need to be able to shut down the
system if needed
- Building in mechanisms to do this is often difficult
- If the operator is part of the environment that the system learns
about, the AI could conceivably learn policies that try to avoid the
human shutting it down
 “You can't get the cup of coffee if you're dead"
 Example: robot blocks camera to avoid being shut off
Monitoring and Control

Scaling up testing, evaluation, verification, and validation
• The extremely complex, mostly black-box models learned by powerful Deep Learning
systems makes it difficult or impossible to scale up existing TEV&V techniques
• Hard to do enough testing or evaluation when the possible types of unusual inputs or
situations can be huge
• Most existing TEV&V techniques need to specify exactly what the boundaries are that we
care about, which can be difficult or intractable
• Often can only be verified in relatively simple constrained environments – doesn’t scale
up well to more complex environments
• Especially difficult to use standard TEV&V techniques for systems that continue to learn
after deployment (online learning)
• Also difficult to use TEV&V for multi-agent or human-machine teaming environments due
to possible emergent behaviors
25

26
Theoretical issues
• A lot of decision theory and game theory
breaks down if the agent is itself part of
the environment that it's learning about
• Reasoning correctly about powerful ML
systems might become very difficult and
lead to mistaken assumptions with
potentially dangerous consequences
• Especially difficult to model and predict
the actions of agents that can modify
themselves in some way or create other
agents
Embedding agents in the environment can lead to a host of theoretical problems
(source: MIRI Embedded Agency sequence)

Human-AI teaming
• Understanding the boundaries - often even the system designers don't really understand
where the system does or doesn't work
• Example: Researchers didn’t discover the problem of adversarial examples until well after
the systems were already in use; it took several more years to understand the causes of
the problem (and it’s still debated)
• Humans (even the designers) sometimes anthropomorphize too much and therefore use
faulty “machine theories of mind” – current ML systems do not process data and
information in the same way humans do
• Can lead to people trusting AI systems in unsafe situations
27

28
Systems engineering and best practices
• Careful design with safety / assurance issues in
mind from the start
• Getting people to incorporate the best technical
solutions and TEV&V tools
• Systems engineering perspective would likely be
very helpful, but further work is needed to adapt
systems / software engineering approaches to AI
• Training people to not using AI systems beyond
what they're good for
• Being aware of the dual use nature of AI and
developing / implementing best practices to
prevent malicious use (a different issue from
what we’ve been discussing)
- Examples: deepfakes, terrorist use of drones, AI-
powered cyber attacks, use by oppressive regimes
- Possibly borrowing techniques and practices from
other dual-use technologies, such as cybersecurity
(image source)
(image source)

Assuring the Machine Learning
Lifecycle
29

Final notes
• Some of these areas have received a significant amount of attention and research (e.g.,
adversarial examples, generalizability, safe exploration, interpretability), others not quite
as much (e.g., avoiding side effects, reward hacking, verification & validation)
• It's generally believed that if early programming languages such as C had been designed
from the ground up with security in mind, then computer security today would be in a
much stronger position
• We are mostly still in the early days of the most recent batch of powerful ML techniques
(mostly Deep Learning); we should probably build in safety / assurance and security from
the ground up
• Again, the military knows all this; if we want the military to adopt the AI technologies that
we develop here, those technologies will need to be assured and secure
35

Research groups outside APL (partial list)
• Technical AI Safety
- DeepMind safety research (two teams – AI Safety team, Robust & Verified Deep Learning team)
- OpenAI safety team (no particular team website – core part of their mission)
- Machine Intelligence Research Institute (MIRI)
- Stanford AI Safety research group
- Center for Human-Compatible AI (CHAI, UC Berkeley)
• Assured Autonomy
- Institute for Assured Autonomy (IAA, partnership between Johns Hopkins University and APL)
- Assuring Autonomy International Programme (University of York)
- University of Pennsylvania Assured Autonomy research group
- University of Waterloo AssuredAI project
• AI Safety Risks – Strategy, Policy, Analysis
- Future of Life Institute (MIT)
- Future of Humanity Institute (University of Oxford)
- Center for the Study of Existential Risk (CSER, University of Cambridge)
- Center for Security and Emerging Technology (CSET, Georgetown University)
• Many of these organizations are closely tied to the Effective Altruism movement
36

Primary reading
• Technical AI Safety
- Amodei et al, Concrete Problems in AI Safety (2016) – still probably the best technical introduction
- Alignment Newsletter – excellent coverage of related research
 Podcast version
 Database of all links from previous newsletters, arranged by topic – covers almost all major papers
related to the field from the past year or two
- DeepMind’s Safety Research blog
- Informal document from Jacob Steinhardt (UC Berkeley) - overview of several current research directions
• Assured Autonomy: Ashmore et al, Assuring the Machine Learning Lifecycle (2019)
• Longer-term concerns
- Stuart Russell, Human Compatible: Artificial Intelligence and the Problem of Control (2019)
- Nick Bostrom, Superintelligence: Paths, Dangers, Strategies (2014)
 Excellent series of posts summarizing each chapter and providing additional notes
- [Tom Chivers, The AI Does Not Hate You: Superintelligence, Rationality and the Race to Save the World
(2019) – lighter overview of the subject from a journalist; includes a good history of the AI Safety
movement and other closely related groups]
37

Partial bibliography: General / Literature Reviews
• Saria et al (JHU), Tutorial on safe and reliable ML (2019); video, slides, references
• Richard Mallah (Future of Life Institute), “The Landscape of AI Safety and Beneficence
Research,” 2017
• Hernandez-Orallo et al, Surveying Safety-relevant AI Characteristics (2019)
• Rohin Shah (UC Berkeley), An overview of technical AGI alignment (podcast episode with
transcript, 2019) – part 1, part 2, related video lecture
• Everitt et al, AGI Safety literature review (2018)
• Paul Christiano, AI alignment landscape (2019 blog post)
• Andrew Critch and Stuart Russell, detailed syllabus with links from a fall 2018 AGI Safety
course at UC Berkeley
• Joel Lehman (Uber), Evolutionary Computation and AI Safety: Research Problems
Impeding Routine and Safe Real-world Application of Evolution (2019)
• Victoria Krakovna, AI safety resources list
38

Partial bibliography: Technical AI Safety literature
• AI Alignment Forum, including several good curated post sequences
• Paul Chrisiano, Directions and desiderata for AI alignment (2017 blog post)
• Rohin Shah (UC Berkeley), Value Learning sequence (2018) – gives a thorough introduction to the
problem and explains some of the most promising approaches
• Leike et al (DeepMind), Reward Modeling (2018); associated blog post
• Dylan Hadfield-Menell (UC Berkeley), Cooperative Inverse Reinforcement Learning (2016); associated
podcast episode; also see this video lecture
• Dylan Hadfield-Menell (UC Berkeley), Inverse Reward Design (2017)
• Christiano et al (OpenAI), Iterative Amplification (2018); associated blog post; Iterative Amplification
sequence on the Alignment Forum
• Irving et al (OpenAI), Value alignment via debate (2018); associated blog post, podcast episode
• Christiano et al (OpenAI, DeepMind), Deep reinforcement learning from human preferences (2017)
• Andreas Stuhlmüller (Ought), Factored Cognition (2018 blog post)
• Stuart Armstrong (MIRI / FHI), Research Agenda v0.9: Synthesizing a human's preferences into a utility
function (2019 blog post)
39

Partial bibliography: Assured Autonomy literature
• University of York, Assuring Autonomy Body of Knowledge (in development)
• Assuring Autonomy International Program, list of research papers
• Sandeep Neema (DARPA), Assured Autonomy presentation (2019)
• Schwarting et al (MIT, Delft University), Planning and Decision-Making for Autonomous Vehicles (2018)
• Kuwajima et al, Open Problems in Engineering Machine Learning Systems and the Quality Model (2019)
• Colinescu et al (University of York), Socio-Cyber-Physical Systems: Models, Opportunities, Open
Challenges (2019) – focuses on the human component of human-machine teaming
• Salay et al (University of Waterloo), Using Machine Learning Safely in Automotive Software (2018)
• Czarnecki et al (University of Waterloo), Towards a Framework to Manage Perceptual Uncertainty for
Safe Automated Driving (2018)
• Colinescu et al (University of York), Engineering Trustworthy Self-Adaptive Software with Dynamic
Assurance Cases (2017)
• Lee et al (University of Waterloo), WiseMove: A Framework for Safe Deep Reinforcement Learning for
Autonomous Driving (2019)
• Garcia et al, A Comprehensive Survey on Safe Reinforcement Learning (2015)
40

Partial bibliography: Misc.
• Avoiding side effects
- Krakovna et al (DeepMind), Penalizing side effects using stepwise relative reachability (2019); associated blog post
- Alex Turner, Towards a new impact measure (2018 blog post)
- Achiam et al (UC Berkeley), Constrained Policy Optimization (2017)
• Testing and verification
- Defense Innovation Board, AI Principles: Recommendations on the Ethical Use of Artificial Intelligence by the Department
of Defense, Appendix IV.C (2019) – study by the MITRE Corporation on the state of AI T&E
- Kohli et al (DeepMind), Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification
(2019 blog post) – references several important papers on testing and validation of advanced ML techniques, and
summarizes some of DeepMind’s research in this area
- Haugh et al, The Status of Test, Evaluation, Verification, and Validation (TEV&V) of Autonomous Systems (2018)
- Hains et al, Formal methods and software engineering for DL (2019)
• Security: Xiao et al, Characterizing Attacks on Deep Reinforcement Learning (2019)
• Control: Babcock et al, Guidelines for Artificial Intelligence Containment (2017)
• Risks from emergent behavior: Jesse Clifton, Cooperation, Conflict, and Transformative Artificial
Intelligence: A Research Agenda (blog post sequence, 2019)
• Long term risks:
- AI Impacts
- Ben Cottier and Rohin Shah, Clarifying some key hypotheses in AI alignment (blog post, 2019)
41

Introduction to AI Safety (public presentation).pptx

Introduction to AI Safety (public presentation).pptx

Recommended

Recommended

More Related Content

Similar to Introduction to AI Safety (public presentation).pptx

Similar to Introduction to AI Safety (public presentation).pptx (20)

Recently uploaded

Recently uploaded (20)

Introduction to AI Safety (public presentation).pptx

Editor's Notes