Eval tuning

 Posts: 1701
 Joined: Wed Apr 14, 2004 16:04
 Contact:
Re: Eval tuning
Very interesting experiments, Michel!
I am just curious how much of a better fit the DragonML eval is compared to Scan. So if you would run the Scan eval() on your 900M training set, how well does it predict the game scores compared to your own 50Mparameter eval()?
Not sure if you do loglikelihood or least squares, so the metric is a bit tricky. But e.g. if you store the full optimization trace: after how many iterations would the Scan eval() be as good a fit as the DragonML one, and how many iterations did the optimization for DragonML version go on after that hypothetical point to full convergence?
I am just curious how much of a better fit the DragonML eval is compared to Scan. So if you would run the Scan eval() on your 900M training set, how well does it predict the game scores compared to your own 50Mparameter eval()?
Not sure if you do loglikelihood or least squares, so the metric is a bit tricky. But e.g. if you store the full optimization trace: after how many iterations would the Scan eval() be as good a fit as the DragonML one, and how many iterations did the optimization for DragonML version go on after that hypothetical point to full convergence?
Re: Eval tuning
You can't really do that, because the training set is tightly coupled to the engine and its search algorithm itself. I am not quite sure what you mean by 'optimization trace'.Rein Halbersma wrote:Very interesting experiments, Michel!
I am just curious how much of a better fit the DragonML eval is compared to Scan. So if you would run the Scan eval() on your 900M training set, how well does it predict the game scores compared to your own 50Mparameter eval()?
Not sure if you do loglikelihood or least squares, so the metric is a bit tricky. But e.g. if you store the full optimization trace: after how many iterations would the Scan eval() be as good a fit as the DragonML one, and how many iterations did the optimization for DragonML version go on after that hypothetical point to full convergence?
When you increase the training set, at some point you will reach a limit, where adding more will not improve the eval function anymore. What you can do at this point, is to increase the number of features in your eval function. The eval accuracy will go up, but then it's speed will go down.
I think this is a major difference between scan and dragon. Scan has few features, but runs fast. Dragon has many features (including global mobility, breakthrough tables), but it runs slower. There is always a tradeoff.

 Posts: 1701
 Joined: Wed Apr 14, 2004 16:04
 Contact:
Re: Eval tuning
I meant the output that numerical optimizers usual give, so for each iteration in their solver they print out the objective function (sum of squares, loglikelihood, whatever), the gradient and/or the step size.MichelG wrote:You can't really do that, because the training set is tightly coupled to the engine and its search algorithm itself. I am not quite sure what you mean by 'optimization trace'.Rein Halbersma wrote:Very interesting experiments, Michel!
I am just curious how much of a better fit the DragonML eval is compared to Scan. So if you would run the Scan eval() on your 900M training set, how well does it predict the game scores compared to your own 50Mparameter eval()?
Not sure if you do loglikelihood or least squares, so the metric is a bit tricky. But e.g. if you store the full optimization trace: after how many iterations would the Scan eval() be as good a fit as the DragonML one, and how many iterations did the optimization for DragonML version go on after that hypothetical point to full convergence?
If you would use the Scan eval() and feed it your training positions, and evaluate your optimization criterion, how high would that be? And after how many iterations does the DragonML eval() reach that level of prediction? And how much does DragonML improve on that when it reaches convergence. So I'm interested in the relative predictive power for Scan and DragonML evals.
Re: Eval tuning
i think i understand what you mean. But you can't compare 2 eval functions this way. Scan en dragon are trained on different datasets. For example, dragon trains on 'balanced' positions only. Running the scan eval on the dragon dataset would give it poor results, as would running dragon on the scan dataset.Rein Halbersma wrote: I meant the output that numerical optimizers usual give, so for each iteration in their solver they print out the objective function (sum of squares, loglikelihood, whatever), the gradient and/or the step size.
If you would use the Scan eval() and feed it your training positions, and evaluate your optimization criterion, how high would that be? And after how many iterations does the DragonML eval() reach that level of prediction? And how much does DragonML improve on that when it reaches convergence. So I'm interested in the relative predictive power for Scan and DragonML evals.
Apart from that, the internal iterations of the solver don't have any real meaning. At the end of the process, you end up with the answer of the linear algebra problem you posed. How it got there, and whether it used internal iterations is irrelevant in my opinion.
I think a better question is to look at the correlation coefficient between the eval function of a position, and the outcome of the game that results from that.
But then there is no objective means of determining the gameoutcome of a position. As long we don't have a 40piece database, the game outcome depends on by which program, with which settings the game is produced.
I don't have the numbers available, but i know the cc will be a small number.
In the end, the only objective test is to have programs play against eachother

 Posts: 816
 Joined: Sat Apr 28, 2007 14:53
 Real name: Ed Gilbert
 Location: Morristown, NJ USA
 Contact:
Kingsrow with auto tuned eval
I put a link to kingsrow beta test version 1.57a at the kingsrow download web page. This version uses an eval function that was created automatically using logistic regression. Against the classic kingsrow it scores 0.608 in engine matches. Against dragon it scores essentially equal.
Right now the eval uses patterns in rectangular 4x4 board regions, plus some nonpattern material features. My next step is to see if I can get an improvement by using 5x5 board patterns.
The eval was created using a training set of 0.97M games, 72.2M positions. Training games are a combination of classic kr vs classic kr, classic kr vs autotuned kr, and auto kr vs auto kr. I found that using only classic vs classic games resulted in an eval with rather weak performance. I also tried a large number of games scan vs kingsrow, but using these games for training did not give good results either. I used 5move and 6move start positions in an attempt to get good coverage of the board patterns and avoid duplicate games. But I still removed 80K games because they were too similar to other games. I think it might be useful to add some randomness to the eval during training games, but I didn't try it.
Every time I increased the number of training positions, I got an increase in the strength of the resulting eval. With 72M training positions, I imagine I have reached the point of diminishing returns, but I don't really know. Regression takes only a few hours to converge, but it takes a while to play all the training games. I used dxp to play the games. But I don't like the way the time controls that are hardwired into dxp encourage using most of the time in the earlier part of the games, so I created a new dxp message type DXP_GAMEREQ_FISCHER. All the training games were played using either 0.1sec or 0.12sec per move time increment.
I'm using gradient descent to optimize the weights. I'm also adding L2 regularization to the gradients to try to prevent some of the weights from getting too large. This was a problem with some of the patterns that appear in only a small number of training positions. Regularization is added only to the pattern gradients, not to the material and other nonpattern features.
I started off using a crossentropy cost function, because this is supposed to be the preferred method for logistic regression. But I never got good results with it. The evals were not as good as the classic kingsrow eval. When I tried using a mean squared error cost function I got better results. I still don't know why crossentropy didn't work well for me. The optimization seemed to converge ok but the weights just didn't play well.
 Ed
Right now the eval uses patterns in rectangular 4x4 board regions, plus some nonpattern material features. My next step is to see if I can get an improvement by using 5x5 board patterns.
The eval was created using a training set of 0.97M games, 72.2M positions. Training games are a combination of classic kr vs classic kr, classic kr vs autotuned kr, and auto kr vs auto kr. I found that using only classic vs classic games resulted in an eval with rather weak performance. I also tried a large number of games scan vs kingsrow, but using these games for training did not give good results either. I used 5move and 6move start positions in an attempt to get good coverage of the board patterns and avoid duplicate games. But I still removed 80K games because they were too similar to other games. I think it might be useful to add some randomness to the eval during training games, but I didn't try it.
Every time I increased the number of training positions, I got an increase in the strength of the resulting eval. With 72M training positions, I imagine I have reached the point of diminishing returns, but I don't really know. Regression takes only a few hours to converge, but it takes a while to play all the training games. I used dxp to play the games. But I don't like the way the time controls that are hardwired into dxp encourage using most of the time in the earlier part of the games, so I created a new dxp message type DXP_GAMEREQ_FISCHER. All the training games were played using either 0.1sec or 0.12sec per move time increment.
I'm using gradient descent to optimize the weights. I'm also adding L2 regularization to the gradients to try to prevent some of the weights from getting too large. This was a problem with some of the patterns that appear in only a small number of training positions. Regularization is added only to the pattern gradients, not to the material and other nonpattern features.
I started off using a crossentropy cost function, because this is supposed to be the preferred method for logistic regression. But I never got good results with it. The evals were not as good as the classic kingsrow eval. When I tried using a mean squared error cost function I got better results. I still don't know why crossentropy didn't work well for me. The optimization seemed to converge ok but the weights just didn't play well.
 Ed
Last edited by Ed Gilbert on Tue Dec 29, 2015 16:42, edited 1 time in total.

 Posts: 987
 Joined: Thu Jun 20, 2013 17:16
 Real name: Krzysztof Grzelak
Re: Eval tuning
Thank you for the information and the new versions of the program Ed.
Re: Eval tuning
Ed, I also just started in my holiday period to code some logistic regression stuff.
So far I only have 100K games , which I generated with a 4 ply search and material only evaluation.
The only thing used, next to the material only evaluation, was a 7P DB.
To get some randomness I used a random shuffle of the root moves, and in the end nodes of the search I also added a small random score (between 2 and 2).
From the position set , I only used the positions with equal material, no kings, and no capture or capture thread.
I also did not differentiate between game phases (yet).
I also started with the Scan approach, which is 16 regions, with 4x4 board squares (= 8 board positions).
So 16 * 6561 = 104976 parameters.
Now I'm working on the gradient descent (learning rate is 0.1), but so far convergence is very poor, so I might do something wrong.
After 10.000 iterations the cost function reduced from 0.6931 to 0.6891.
I did not include a regularization term yet.
( Question what is the initial cost you see, and after how many iterations do you see convergence, what is the cost level you reach normally?).
The cost function I use is sigma ( Result * log(sigmoid(position))  ( 1 result) * log( 1 sigmoid(position)))
In the end I divide the cost function by the number of positions.
The initial feature vector is only with zeros.
Result is 0 for a lost position, 0.5 draw, and 1.0 win (all perspective white).
So the initial loss for every position (lost, draw, and win) is log(0.5), so average loss also log(0.5), which is the aforementioned 0.6931......
Bert
So far I only have 100K games , which I generated with a 4 ply search and material only evaluation.
The only thing used, next to the material only evaluation, was a 7P DB.
To get some randomness I used a random shuffle of the root moves, and in the end nodes of the search I also added a small random score (between 2 and 2).
From the position set , I only used the positions with equal material, no kings, and no capture or capture thread.
I also did not differentiate between game phases (yet).
I also started with the Scan approach, which is 16 regions, with 4x4 board squares (= 8 board positions).
So 16 * 6561 = 104976 parameters.
Now I'm working on the gradient descent (learning rate is 0.1), but so far convergence is very poor, so I might do something wrong.
After 10.000 iterations the cost function reduced from 0.6931 to 0.6891.
I did not include a regularization term yet.
( Question what is the initial cost you see, and after how many iterations do you see convergence, what is the cost level you reach normally?).
The cost function I use is sigma ( Result * log(sigmoid(position))  ( 1 result) * log( 1 sigmoid(position)))
In the end I divide the cost function by the number of positions.
The initial feature vector is only with zeros.
Result is 0 for a lost position, 0.5 draw, and 1.0 win (all perspective white).
So the initial loss for every position (lost, draw, and win) is log(0.5), so average loss also log(0.5), which is the aforementioned 0.6931......
Bert

 Posts: 816
 Joined: Sat Apr 28, 2007 14:53
 Real name: Ed Gilbert
 Location: Morristown, NJ USA
 Contact:
Re: Eval tuning
I think it's probably good to start as you did with only material in the evaluation. Using the classic kr, the eval is very much biased for certain types of formations, and it affects the quality of the training samples.So far I only have 100K games , which I generated with a 4 ply search and material only evaluation.
The only thing used, next to the material only evaluation, was a 7P DB.
To get some randomness I used a random shuffle of the root moves, and in the end nodes of the search I also added a small random score (between 2 and 2).
I used 6pc dbs for all matches. I removed positions that have <= 6 pieces. But even before that, I removed any games where more than the first 35% of the moves were identical to another training game.
Following the Texel tuning method, I did a qsearch of each game position, and used the pv leaf node pos from the qsearch as the training positions. I removed identical qsearch positions that were from the same game. I also removed qpositions that have > 38 pieces.
I autotuned the weights of men, and kings, so I had to use positions with nonzero attributes of these features. 22% of the training positions have a men imbalance, and 6% have a king imbalance.From the position set , I only used the positions with equal material, no kings, and no capture or capture thread.
I used the same scheme as the chess guys, so I have a set of weights for beginning phase and another set for the end phase. The eval uses phase interpolation between both sets, similar to scan. My definition of phase is coarser than scan's, which I think goes from 0 to 300. Mine goes from 0 to 80. It's a function of the number of men on the board and which half of the board they occupy.I also did not differentiate between game phases (yet).
But half of those regions are just mirror images of the other half, so you only really need weights for 8 regions. In other words, the same set of weights for the rectangular region of squares 1,2,16,17 can be used for the region of squares 34,35,49,50.I also started with the Scan approach, which is 16 regions, with 4x4 board squares (= 8 board positions).
So 16 * 6561 = 104976 parameters.
You're using crossentropy cost function, which has different magnitudes than the mean squared error that I'm using, so you really can't compare. For number of iterations, I'm doing about 9 iterations of minibatches, then another 20 or more full iterations. Each minibatch is 1/50 of an iteration, so 450 minibatches before switching to full iterations. Using minibatches is a nice time saver.( Question what is the initial cost you see, and after how many iterations do you see convergence, what is the cost level you reach normally?).
I'm doing the same, except all is from black's perspective, in keeping with my preference for English checkers conventions :)The initial feature vector is only with zeros.
Result is 0 for a lost position, 0.5 draw, and 1.0 win (all perspective white).
 Ed
Re: Eval tuning
As most of you know I am really interested in a program that would support the Killerrule. If I have understood past conversations correctly, then this would require 1) modification of move generator, 2) new evaluation function and 3) new endgame databases.
I believe that changing the move generator never was the big issue. Would it be right to say that with several people now working on eval learning that the second hurdle has also become a lot less difficult to overcome?
Am I right in thinking that with eval learning in place it would now take days/weeks instead of months to create a good quality eval specifically for Killer?
I believe that changing the move generator never was the big issue. Would it be right to say that with several people now working on eval learning that the second hurdle has also become a lot less difficult to overcome?
Am I right in thinking that with eval learning in place it would now take days/weeks instead of months to create a good quality eval specifically for Killer?

 Posts: 1701
 Joined: Wed Apr 14, 2004 16:04
 Contact:
Re: Eval tuning
first hypothesis: just plug a regular draughts eval into a Killer engine with Killer databases and you are getting already a very good Killer engineWalter wrote:As most of you know I am really interested in a program that would support the Killerrule. If I have understood past conversations correctly, then this would require 1) modification of move generator, 2) new evaluation function and 3) new endgame databases.
I believe that changing the move generator never was the big issue. Would it be right to say that with several people now working on eval learning that the second hurdle has also become a lot less difficult to overcome?
Am I right in thinking that with eval learning in place it would now take days/weeks instead of months to create a good quality eval specifically for Killer?
second hypothesis: a specifically tuned Killer eval is not just better at Killer, but also at regular draughts (because the enormously high drawing percentage of regular draughts does not give enough information to the fitting procedure). This would also apply to humans: study Killer middle games to become better at regular draughts.

 Posts: 987
 Joined: Thu Jun 20, 2013 17:16
 Real name: Krzysztof Grzelak
Re: Eval tuning
Sorry for asking. And where can I get the program Killer.
Re: Eval tuning
I tried to modify Scan today. You can download the result with below link. I give absolutely no guarantees that it works properly. You can let it play against itself, e.g. with commandsKrzysztof Grzelak wrote:Sorry for asking. And where can I get the program Killer.
time 1
2
I tried it a few times and it seems to have a tendency for white to win. Could be a bug in my modification or could be that in Killer the first one to move has an advantage.
http://www.killerdraughts.org/killerscan.zip
Re: Eval tuning
Your first hypothesis might be right. Today I changed the move generator of Scan to introduce the Killerrule. This seems to work and I am playing a DXP match using the Damage version that can use Scan as engine (thanks Bert!!). After watching some of the games I think indeed that Killer databases are more important than a Killereval. With neither of these available the search alone does not seem to go towards Killerwins. The potential wins are outside the search depth and neither the eval nor the database (I only kept the 2 pieces DB which I am assuming still applies to Killer) is able guide the search in maximizing the potential. I am guessing the a 6 KillerDB would provide more 'direction' than a Killereval. Unfortunately, Scan does not include the code for generating databases so I cannot explore this further.Rein Halbersma wrote: first hypothesis: just plug a regular draughts eval into a Killer engine with Killer databases and you are getting already a very good Killer engine
second hypothesis: a specifically tuned Killer eval is not just better at Killer, but also at regular draughts (because the enormously high drawing percentage of regular draughts does not give enough information to the fitting procedure). This would also apply to humans: study Killer middle games to become better at regular draughts.
Your second hypothesis would be my best guess as well. I am sure that one day we will find out
Re: Eval tuning
I played two DXP matches with the KillerScan version. In about 50% of the games one of the Scan engines wins. With identical search, identical eval, no databases and identical resources it would seem fair to conclude that even though the eval score is close to 0.00 there actually is an interesting positional advantage for one side. Learning from this would indeed probably make the program stronger in regular draughts as well. It could result in having the better position more often but against a strong opponent it would still all result in a draw of course.Rein Halbersma wrote: first hypothesis: just plug a regular draughts eval into a Killer engine with Killer databases and you are getting already a very good Killer engine
second hypothesis: a specifically tuned Killer eval is not just better at Killer, but also at regular draughts (because the enormously high drawing percentage of regular draughts does not give enough information to the fitting procedure). This would also apply to humans: study Killer middle games to become better at regular draughts.
 Attachments

 KillerScan 5min 80 moves.pdn
 (24.77 KiB) Downloaded 150 times

 Posts: 1701
 Joined: Wed Apr 14, 2004 16:04
 Contact:
Re: Eval tuning
The old (~1998) version of the Dragon program generated the Dam 2.2/Moby Dam/Scan compatible databases. The source is available from http://hjetten.home.xs4all.nl/DragonDra ... .Win32.zip You could modify the move generator there as well and make 26 pc dbs for Killer. These can be plugged in to both Moby Dam and Scan. It would be interesting to modify Moby Dam as well and see if a Killer match between Moby Dam / Scan gives a bigger score difference than a regular draughts match.Walter wrote:I played two DXP matches with the KillerScan version. In about 50% of the games one of the Scan engines wins. With identical search, identical eval, no databases and identical resources it would seem fair to conclude that even though the eval score is close to 0.00 there actually is an interesting positional advantage for one side. Learning from this would indeed probably make the program stronger in regular draughts as well. It could result in having the better position more often but against a strong opponent it would still all result in a draw of course.Rein Halbersma wrote: first hypothesis: just plug a regular draughts eval into a Killer engine with Killer databases and you are getting already a very good Killer engine
second hypothesis: a specifically tuned Killer eval is not just better at Killer, but also at regular draughts (because the enormously high drawing percentage of regular draughts does not give enough information to the fitting procedure). This would also apply to humans: study Killer middle games to become better at regular draughts.