Your cart is currently empty!
List of assignment due dates. The assignment should be submitted via Blackboard. Submit a file called assignment6.zip, containing the following two files: pdf, for your answers to the written tasks, and for the output that the programming task asks you to include. Only PDF files will be accepted. All text should be typed, and if any…
The assignment should be submitted via Blackboard. Submit a file called assignment6.zip, containing the following two files:
These naming conventions are mandatory, non-adherence to these specifications can incur a penalty of up to 20 points.
Your name and UTA ID number should appear on the top line of both documents.
In this task you will implement the value iteration algorithm.
Your function will be invoked as follows:
value_iteration(<environment_file>, <non_terminal_reward>, <gamma>, <K>)
The arguments provide to the program the following information:
Figure 1: The environment described in file environment1.txt. |
The environment file will follow the same format as files environment1.txt and environment2.txt. For example, file environment1.txt describes the world shown in Figure 1, and it has the following contents:
1.0,X
.,-1.0
Figure 2: The environment described in file environment2.txt. |
Similarly, file environment2.txt describes the world shown in Figure 2, and it has the following contents:
.,.,.,1.0
.,X,.,-1.0
.,.,.,.
As you see from the two examples, the environment files are CSV (comma-separated values) files, where:
At the end of your program, you should print out the utility values for all states.
The output should follow this format:
%6.3f,%6.3f,…
…
In other words, each row in your output corresponds to a row in the environment, and you use the %6.3f format specifier (or equivalents, depending on the programming language) for each utility value.
Do NOT print out this output after each iteration. You should only print out this output after the final iteration.
In your answers.pdf document, you need to provide the complete output for the following invocations of your program:
value_iteration(‘environment2.txt’, -0.04, 1, 20)
value_iteration(‘environment2.txt’, -0.04, 0.9, 20)
In this task, you will implement the AgentModel_Q_Learning function from the Reinforcement Learning slides. Implement a function that can be called as:
q_learning(<environment_file>, <non_terminal_reward>, <gamma>, <number_of_moves>, <Ne>)
The command line arguments should be:
The outcome of each move should be generated randomly, following the state transition model described in pages 9-10 of the MDP slides.
As in the previous task, for terminal states and blocked states, your model should not allow any action to be performed once you reach those states. Note that the AgentModel_Q_Learning pseudocode on the slides does handle this case appropriately, and your implementation should handle this case the same way: terminate the current mission and start a new mission. When starting a new mission, the start state should be chosen randomly (with equal probability) from all possible states, except for terminal states and blocked states.
For the η function, use η(N) = 1/N
For the f function, use:
Your solution needs to somehow simulate the SenseStateAndReward function, which should be pretty easy. Your solution should also simulate somehow the ExecuteAction function, which should implement the state transition model described in pages 9-10 of the MDP slides, with the probabilities that are used in those slides. As described in those slides, bumping to a wall leads to not moving.
Note that some computations will require values Q[s,a] that have not been instantiated yet. Uninstantiated values in the Q table should be treated as if they are equal to 0.
At the end of your program, you should print out the utility values for all states.
The output should follow this format:
%6.3f,%6.3f,…
…
In other words, each row in your output corresponds to a row in the environment, and you use the %6.3f format specifier (or equivalents, depending on the programming language) for each utility value.
Do NOT print out this output after each iteration. You should only print out this output after the final iteration.
In your answers.pdf document, you need to provide the complete output for the following invocations of your program:
q_learning(‘environment2.txt’, -0.04, 1, 1000, 20)
q_learning(‘environment2.txt’, -0.04, 0.9, 1000, 20)
Suppose that you want to implement a Q-Learning algorithm for learning how to play chess.
Your choices should be the best choices that you can make so that your algorithm plays chess as well as possible.
Figure 3: The environment to be considered in this task. |
Consider the environment shown on Figure 3. States (1,2) and (3,2) are terminal, with utilities -1 and +1. States (2,1) and (2,3) are blocked. Suppose that actions and state transition models are as described in pages 9-10 of the MDP slides.
Part a: Suppose that the reward for non-terminal states is -0.04, and that γ=0.9. What is the utility for state (2,2)? Show how you compute this utility.
Part b: Suppose that γ=0.9, and that the reward for non-terminal states is an unspecified real number r (that can be positive or negative). For state (2,2), give the precise range of values for r for which the “up” action is not optimal. Show how you compute that range.