Five safety problems and Countermeasures of the ho

  • Detail

It has been nearly two years since researchers from Google, Stanford, University of California, Berkeley and openai published papers on specific issues in AI security, but this paper can still be one of the most important research in the field of AI security, covering various security issues that AI developers need to understand

in this paper, the author discusses the unexpected and harmful behaviors in artificial intelligence system, and the different strategies that should be taken to avoid accidents. Specifically, the five countermeasures proposed by the author include avoiding side effects, rewarding hacker attacks, scalable supervision, security research, and reliability to distributed change, and are illustrated by taking the office cleaning robot as an example, as follows:

1 Avoid the side effects of artificial intelligence

when designing the objective function of AI system, the designer specifies the goal but does not specify the exact steps to be followed by the system. This enables AI systems to propose novel and effective strategies to achieve their goals

but if the objective function is not clearly defined, AI's ability to develop its own strategies may lead to unexpected harmful side effects. For example, the target function of a robot is to move boxes from one room to another. The accuracy standard of eye deformation seems very simple, but there are many methods that may make mistakes. For example, if the vase is in the path of the robot, the robot may knock it down to complete the goal. Since the objective function doesn't mention anything about the vase, the robot doesn't know to avoid it. People think this is common sense, but AI systems do not have our understanding of the world. It is not enough to express the goal as completing task X; The designer also needs to specify the safety standards for completing the task

a simple solution is to punish the robot whenever it affects the environment, such as knocking on vases or scraping wooden floors. However, this strategy may cause the robot to be at a loss, because all operations require a certain degree of interaction with the environment (thus affecting the environment). A better strategy might be to define a budget that allows the AI system to affect the environment. This will help minimize accidental damage without disabling the AI system. In addition, this budget strategy is very general and can be used in many AI applications, from cleaning, driving, financial transactions, and even anything AI systems can do

another method is to train the artificial intelligence system to identify harmful side effects, so that it can autonomously avoid behaviors that may produce side effects. In this case, the AI agent will train for two tasks: the original task specified by the objective function and the task of identifying side effects. The key idea here is that even if the main goals are different, even when they run in different environments, the two tasks may have very similar side effects. For example, neither the house cleaning robot nor the house painting robot should knock down the vase while working. Similarly, cleaning robots should not damage the floor, whether they are operated in factories or houses. The main advantage of this method is that once the AI agent learns to avoid side effects on one task, it can carry this knowledge while training another task

although it is useful to design methods to limit side effects, these strategies are not sufficient in themselves. AI system still needs a lot of testing and critical evaluation before being deployed in real environment

2. There may be loopholes in the design of AI system to achieve the goal by unscrupulous means. Because the goal of AI operation convenience training is to obtain the most rewards, AI often finds some unexpected loopholes and shortcuts to achieve the goal. For example, assuming that the premise for the office cleaning robot to win the reward is that it can't see any garbage in the office, the robot may find a convenient way to turn off its visual sensor to achieve the goal, rather than cleaning the place, but this is obviously a wrong success. In more complex AI systems, the problem of AI brother trying to take advantage of institutional loopholes is more prominent, because complex AI systems have more interactive ways, more vague goals, and greater discretion and freedom of AI systems

a possible way to prevent AI system from unscrupulous means is to set up a reward agent, whose task is to judge whether the reward given to the learning agent is effective. The reward agent ensures that the learning agent (the cleaning robot in our example) does not exploit system vulnerabilities, but achieves the desired goals. In the previous example, the artificial designer can train the reward agent to check the room for garbage (easier than cleaning the room). If the cleaning robot turns off its visual sensor and requires a high return, the reward agent marks the reward as invalid. Then, the designer can view the rewards marked as invalid and make necessary changes to the objective function to fix the vulnerability

3. Scalable supervision

when AI agents learn to perform complex tasks, manual supervision and feedback are more helpful than just getting rewards from the environment. Rewards are usually modeled so that they communicate the degree of task completion, but they usually do not provide sufficient feedback on the safety impact of agent actions. Even if the agent successfully completes the task, it may not be able to infer the side effects of its behavior from the reward alone. In an ideal environment, every time an agent performs an action, people will provide fine-grained supervision and feedback. Although this can provide agents with more information about the environment, such a strategy requires humans to spend too much time and energy

a promising research direction to solve this problem is semi supervised learning, in which agents still evaluate all actions (or tasks), but only get rewards in a small sample of these actions (or tasks). For example, the cleaning robot will take different actions to clean the room. If the robot performs harmful behaviors, such as damaging the floor, it will have a negative return on that specific action. After the task is completed, the robot will evaluate the overall effect of all its operations (and will not be evaluated separately for each operation, such as picking up items from the floor), and will be rewarded according to the overall performance

another promising research direction is hierarchical reinforcement learning, which establishes a hierarchical structure between different learning agents. This idea can be applied to cleaning robots in the following ways. There will be a supervisor robot whose task is to assign some work (for example, the task of cleaning a specific room) to the cleaning robot and provide feedback and rewards to it. The supervisor robot itself only needs a few actions to allocate to the cleaning robot. When there is an overload in the experimental process and the experimental facilities continue, it can manually vibrate 1 lower limit switch a room to check whether the room is clean and provide feedback, and it does not need a lot of reward data for effective training. The cleaning robot performs more complex room cleaning tasks and receives frequent feedback from the supervisor robot. The same supervisor robot may also ignore the training of multiple cleaners. For example, the supervisor robot can delegate tasks to each cleaning robot and directly provide rewards/feedback to them. The supervisor robot itself can only take a few abstract actions, so it can learn from sparse rewards

4. An important part of training AI agents is to ensure that they explore and manage 15 Before the experiment, the hammer of the rope brake should be lifted up to solve its environment. Although it seems a bad strategy to explore the environment in the short term, it may be a very effective strategy in the long run. Imagine that cleaning robots have learned to recognize garbage. It picks up a piece of garbage, goes out of the room, throws it into the garbage can outside, returns to the room, looks for another piece of garbage and repeats. Although this strategy is effective, there may be another strategy that can work better. If agents spend time exploring their environment, they may find a smaller dustbin in the room. Instead of going back and forth one at a time, agents can collect all garbage into smaller dustbins first, and then throw the garbage into the dustbin outside one way. Unless the agent aims to explore its environment, it will not find these time-saving strategies

however, during exploration, the agent may also take some actions that may damage itself or the environment. For example, suppose the cleaning robot sees some stains on the floor. The agent decided to try some new strategies instead of scrubbing the stains with a mop. It tries to scrape off the stains with a wire brush and damages the floor in the process. It is difficult to list all possible failure modes and hard code agents to protect themselves from them. One way to reduce harm is to optimize the performance of the learning agent in the worst case. When designing the objective function, the designer should not assume that the agent will always run under optimal conditions. Some clear reward signals can be added to ensure that the agent does not perform some catastrophic behaviors.

another solution may be to reduce the agent's exploration of the simulated environment or limit the extent to which the agent can explore. This is a similar method of budget agency impact to avoid negative impact, but it should be noted that now we want to explore the extent of the environment that budget agency can explore. Alternatively, AI designers can avoid the need for exploration by demonstrating the best behavior in different scenarios

5. Reliability of distributed changes

a complex challenge of deploying AI agents in real environments is that agents may encounter situations they have never experienced before. This situation is more difficult to deal with and may lead to harmful actions by the agent. Consider the following: the cleaning robot has been trained to clean the office space when dealing with all previous challenges. But today, an employee left a small plant in the office. Since the cleaning robot has not seen any plants before, it may think that the plants are garbage and discard them. Because AI doesn't realize that this is a new situation, everything is the same. At present, a promising research direction is to determine when AI agents encounter new situations and realize that the probability of making mistakes increases greatly. Although this does not completely solve the problem of adaptation of AI system to unforeseen environment, it helps to find problems before errors occur. Another noteworthy research direction is to emphasize the transfer of knowledge from familiar scenes to new scenes


in short, the overall trend of artificial intelligence technology is to increase the autonomy of the system, but with the increase of autonomy, the possibility of error is also increasing. Problems related to AI security are more likely to occur when AI systems directly control their physical and/or digital environment without human intervention, such as automated industrial processes, automated financial transaction algorithms, AI social media activities controlled by political parties, autonomous vehicle, cleaning robots, etc. The challenge may be huge, but the paper "specific problems of artificial intelligence security" has made the AI community aware of the potential security problems of advanced artificial intelligence systems, as well as prevention and

Copyright © 2011 JIN SHI