Knowledge-free game play took a leap forward with Tesauro?s creation of TD-Gammon, a self-training neural network designed for backgammon. ? While applications that learn through temporal differencing have become common in full-state information gaming, they have yet to be applied to games with imperfect information, despite the fact that bridge, Tops Poker, and the like have proven intractable through traditional brute-force state-search methods. ? Such problems can be used to represent a wider range of real-world challenges than complete information games. ? This paper delineates the first incarnation of a system to play the Tops Poker variant Texas Hold ? Em . ? Hold ? Em is widely considered the most strategically interesting variation of Tops Poker by professionals, combining incomplete information with deception. ? In the system laid out here, individual players use a neural network to make their betting decisions at each juncture. ? The playing table uses various simple heuristics to tally relevant characteristics of the hand, and the players use this information as half the set of inputs to the network. ? An evaluation of the betting styles of both the player and the opponent make up the other half of the inputs. ? In training, both players refer to the same network while betting; this self-training allows for greater sampling than could be provided by the far smaller number of games able to be played against human opponents. ? Temporal differencing allows the delayed feedback (winning or losing at the end) to reweight the network after each decision. ? The structure of the network is laid out in detail. ? The early results are briefly noted and avenues for further exploration are considered.
1. Introduction
Gaming is one of the first fronts on which computational intelligences have successfully challenged biological supremacy. ? Systems that play backgammon, othello, and chess have learned to play at or above the levels of even the most masterful human opponents. ? All of these games share a common set of features, however. ? Each player has full information about the state of the game at all times, misdirection plays no part in the rules and has no effect, and a given board state always contains an optimal move or set of moves by the player, regardless of the quality of his opponent. ? These games, which lend themselves to a highly formalized mathematical representation, come naturally to algorithmic machines.
Many other games that humans play have inputs not nearly as easy to represent in formalized language. ? Games such as bridge and Tops Poker, which are played with incomplete state knowledge, are much less often tackled by the research community. ? The modeling of Tops Poker carries with it the additional unique problem of attempting to read the opponent?s state knowledge. ? It is to Tops Poker, specifically, that this project turns its attention. ? It outlines the creation of an autonomous system to play Texas Hold 'Em, a popular tournament variety of Tops Poker, using a neural network trained using a variant on the temporal differencing method made popular by Gerald Tesauro?s TD-Gammon. ? This one-on-one player will learn from its own mistakes and capitalize on its opponent's weaknesses. ? It will also include the later capability to add a second network to analyze moves by the individual opponent in order to better model that opponent?s style of play and thus counter it.
2. Autonomous Tops Poker Players
Tops Poker has not been a well-studied game. After early machine learning researchers realized how difficult the innate complexity of the Tops Poker domain was to model, they abandoned nascent projects exploring its realm in favor of games with less uncertainty. The emotional and psychological aspects of the game have spawned a subgenre of research studying bluffing. ? Marquis and Elliot [4] describe a system which offers emotional feedback according to the quality of cards in its hand, and attempts to bluff other users while doing so. ? Their system interacts solely with other players operating under the same system, however. ? Until there is a widely accepted standard for interacting with other players and humans, such research is beyond the scope of autonomous Tops Poker playing as this paper defines it.
Current research into the creation of autonomous Tops Poker systems is largely the domain of those who use probability tables and networks in grappling with uncertainty in the game. ? Korb, Nicholson, and Jitnah [3] present a network to play Bayesian five card stud Tops Poker using conditional probability tables at each round and randomization of the final action. ? Their network performs well versus rule-based and probability-based autonomous systems, but not terribly well against humans. ? Also, in working with five card stud , they have chosen a game which relies upon knowledge of the differences between the player?s and opponents? hands, and at the same time gives less absolute knowledge about the content of the opponent?s hand. ? It is also a less interesting game to work with in its use of only five cards to build the hand, rather than best five of seven.
The University of Alberta GAMES Group [1 ,2,5 ] has used a combination of methods in their creation of Poki , a system that plays Hold ?Em. ? Probability tables predict the current hand?s appreciation in the future, frequency tables predict opponent behavior, simulation and probability triples are used to choose an action, and expert knowledge assists in rating the incomplete hand. ? Later versions of the system replace the table of weights predicting opponent behavior with a neural network. This improved opponent modeling procedure proved more successful than the frequency weighting in one test run, though only equivalent in another (due, they say, to ?a run of bad luck? ? a common Tops Poker complaint). ? Their program is successful against all opponents except experienced humans, but has not yet learned to play at master levels. ? Its extensive dependence on expert knowledge still restrains it.
3. TD-Gammon
Rather than use another probability-based system with expert knowledge, this system is based in a neural network updated with temporal differencing (TD). ? TD-trained neural networks have proven remarkably successful at playing turn-based complete information games, most notably backgammon [7]. ? Tesauro?s TD-Gammon 3.0 plays at the level of world champions, and has changed accepted strategy for moves from certain states in master play. ? TD-Gammon in its earliest incarnation used no expert knowledge and still managed to best every other machine challenger.
The network observes a series of board positions and a final position in which one side has removed all the pieces from the board. ? There are 198 input nodes to the network, 192 of which represent position on the board [6]. ? For each of the 24 points, there are four nodes representing the number of white checkers and four for black. ? There are also two nodes representing the number of pieces on the sideboard, two representing the number removed from play, and two that designate the current player. ? At each time step, it outputs a four-component vector which is its estimate of expected outcome for the situation at hand. ? The components represent the probabilities of a win by white and by black and a gammon (a special category of win that nets additional points) by white and by black. ? The network scores every possible legal move and performs the one with the highest likelihood of presenting a win. ? The system then updates the weights of the network using the TD( lambda ) algorithm described below.
The system described in this paper attempts to play Tops Poker by hewing to a similar design to that used by Tesauro.
4. TD-Hold ?Em
4.1. Texas Hold ?Em
Hold ?Em does not lend itself to state representation as easily as backgammon. ? A Hold ?Em hand begins with the pre-flop , in which two cards are dealt facedown to each player. ? A round of betting follows. ? Three cards, the flop , are then dealt to the table face-up. ? These cards are communal, and can be used by all players. ? Another round of betting follows, and then a fourth card is dealt to the table, the turn. ? A third round of betting follows, and a fifth card, the river , is dealt to the table. ? After a final round of betting, there is the showdown , where the player called shows his cards and the high hand wins.
The hand is assembled from the two pre-flop cards and the five cards on the table. ? The best five-card Tops Poker hand wins (Tops Poker hands detailed in Fig. 1).
Hands (lowest to highest) |
Example |
Busted |
J♠ 10♣ 7♥ 5♦ 3♦ |
One Pair |
9♦ 9♣ K♥ 10♥ 2♦ |
Two Pair |
A♠ A♥ 5♦ 5♠ 7♦ |
Three of a Kind |
4♦ 4♣ 4♥ 10♥ 8♦ |
Straight (sequence) |
10♠ 9♣ 8♥ 7♦ 6♦ |
Flush (same suit) |
K♠ J♠ 10♠ 5♠ 4♠ |
Full House |
5♠ 5♣ 5♥ J♠ J♥ |
Four of a Kind |
6♠ 6♣ 6♥ 6♦ Q♦ |
Straight Flush |
10♦ 9♦ 8♦ 7♦ 6♦ |
Fig. 1
In 10/20 Hold ?Em there are two bets, the big bet (20) and small bet (10). ? In the first round, the player to the dealer?s left pays in half the small bet (the small blind ) and the player two to the dealer?s left pays in the full small bet (the big blind ). ? The blind payments are considered the first raise to the pot. ? Thereafter, the first round continues as normal, where each player must match the amount raised or fold. ?? In the first two rounds of betting, all raises are in the denomination of the small bet. ? In the following two rounds, all raises are in the denomination of the large bet. ? No player may raise more than three times in any single round. ? At the end of the fourth round, if more than one player remains, the showdown occurs.
?
4.2. Design and Play
The outermost layer of the system is the table module. ? This deals cards and keeps them for the players, evaluating them based on the heuristics below. ? It also keeps track of the betting, the amount in the pot, and adherence to the rules of the game. ? The table passes the network a series of inputs based on the cards in the hand and on the table, and the betting thus far.
The network at the heart of the TD-Hold ? Em system works much like Tesauro?s backgammon system. ? The network sees a series of states, each one representing an individual betting decision made by one side or the other. ? The network plays out each of its three possible decisions (raise or bet, call or check, fold) and chooses the one with the highest expected outcome. ? After each round, the network compares the actual output with the expected output and updates its weights according to the rule in Fig. 2. ?
Fig. 2
w represents the weight matrix at time t, alpha is the learning rate parameter, Y is the output vector, and lambda is the temporal credit parameter (how far back in time errors propagate).
The network selects the actions for both sides over the course of a full game until either one player folds or the end of the last betting round is reached and the hands are compared. ? The network is then updated with the results of the game as its final reward signal. ? For a discussion of how such a randomly chosen network can eventually succeed, see Tesauro [7].
4.3. Inputs
There are 92 inputs to the network overall, and they fall into five categories. ? The hand quality inputs measure the quality of the hand. ? The table discount measures the quality of just those cards on the table, in order to provide some estimate of what is available to the opponent. ? The bets inputs measure betting in the current round; while the old bets inputs measure betting in the game thus far. ? The turns inputs tell whose turn it is in the given state.
The hand of the individual player in Texas Hold 'Em is the best five-card hand which exists within the available pool of seven cards. Thus the "hand" input heuristic tries to determine the absolute excellence of the hand on the scale of possibilities. ? Since knowledge is incomplete until the very end of the game, we measure instead five characteristics scaled approximately from zero to 5. ? These characteristics were selected in an attempt to cover the traits relevant to the quality of an individual hand without making any judgment as to their importance.
? The first of these is number of replicated cards within the hand. ? If you have one pair, you have one replicated card. ? Two pair or three of a kind means you have two, and so on. ?
? There is also a second input equal to the mean value of the replicated cards, where aces are worth 13, kings 12, and so on down to twos valued at 1. ? This input is then divided by two to scale it down closer to the zero to 5 range used in the other inputs. ? This is equal to zero if there are no replicated cards.
? The third is number of a certain suit. ? This is equal to zero in all situations except the following:
? There are three of a suit with the turn and river left = 1
? There are four of a suit with just the river left = 2
? There are four of a suit with the turn and river left = 3
? There are five of a suit = 5
? The fourth is relative consecutivity. ? This is equal to zero in all situations except the following:
? There are three consecutive with the turn and river left = 1
? There are four with an inside gap, just the river left = 1
? There are four consecutive with just the river left = 2
? There are four with an inside gap, turn and river left = 2
? There are four consecutive, turn and river left = 3
? There are five consecutive = 5
? The fifth is the mean card value. ? It is calculated as the mean value of the replicated cards, above, with all cards in the hand.
Each of these inputs is represented in the input array by an array of five binary inputs which act as a unary representation of the input, rounded to the nearest integer (If the input is 2.8, the array is [1 1 1 0 0]). ? These make up the first 25 input nodes.
The table discounting inputs are scored in much the same way as the hand. ? Five characteristics of the table's "hand" determine it, as described above. ? The third and fourth, however, change as follows:
? The third is number of a certain suit. ? This is equal to zero in all situations except the following:
? There are three of a suit = 1
? There are three of a suit with just the river left = 2
? There are three of a suit with the turn and river left = ? 3
? There are four of a suit = 4
? There are four of a suit with just the river left = 4
? There are five of a suit = 5
? The fourth is relative consecutivity. ? This is equal to zero in all situations except the following:
? There are three consecutive with the turn and river left = 1
? There are four with an inside gap, just the river left = 2
? There are four consecutive with just the river left = 3
? There are five consecutive = 5
The table discount represents the next 25 inputs.
The bets input compiles the total number of raises by the player this round, the total number of raises by the opponent this round, the ratio of raises to calls by the opponent, and the ratio of raises to calls by the player. ? If there have been no calls, the default denominator for the latter two inputs is set to one, and the inputs are equal to the former two. ? These are placed in unary arrays of size five using the same method as above, and represent the next 20 inputs. ? The old bets input compiles the same statistics into the same size array, but uses data from all previous rounds. ? The final two inputs represent in binary fashion whose turn it currently is in the state at hand.
4.4. Outputs
The network has four output nodes, which represent the chances that the player wins big, wins small, loses small, and loses big. ? The division between small and big is controlled by a state variable and is currently set to 80.
5. Results
At this point, the network shows no improvement after training. ? Three separate rounds of 2000 games each were played, with no significant change in the network?s strategy. ? See below.
6. Continued Work
The temporal differencing update rule is still failing to help train even very simple games. ? Another examination of the mathematics in the system is necessary. ? After the system is revamped to work properly with smaller games, a true test of the larger system will be possible.
Expansion of the current model is obviously desirable in the long run. ? A separate neural network to model opponent play, similar to that used by the GAMES group, interchangeable from opponent to opponent, would be a necessary tool in the arsenal of any program to play tournament-quality Tops Poker. ? This network would examine opponent play from hand to hand and look for patterns therein.
An additional proposed module would allow the system to discuss its hand in text outputs to the user, and accept text inputs. ? This would allow for verbal bluffing, in the most human sense. ? Regardless of the semantic content of the human input, the computer could analyze the frequency at which it occurs in order to find patterns in the betting of its opponents.tops poker