Internet engine matches

Discussion about development of draughts in the time of computer and Internet.
Post Reply
Fabien Letouzey
Posts: 299
Joined: Tue Jul 07, 2015 07:48
Real name: Fabien Letouzey

Re: Internet engine matches

Post by Fabien Letouzey » Fri Aug 07, 2015 19:56

Rein Halbersma wrote:Wouldn't your bottom 4 patterns pick up potential breakthroughs?
That's my assumption, but so does Dragon's. Yet Michel reported a large gain from the table, although it was many years ago. I guess he uses it in search in a different way.
Btw, did you ever study particular pattern coefficients for plausibility?
Assuming you mean having a look at the weights "manually": no I don't do that, because it's very difficult to interpret them due to correlation. For example established features could get a negative weight.
Perhaps your qsearch is also sufficiently different from others to be a contributing factor?
It gains at most 10 Elo in hyper-bullet self-play games. I would guess it's not even measurable in normal games.

BTW I also gain less than 10 Elo from bitbases(!). Many people seem to focus on them as a big computer strength so I'm confused ...

BertTuyt
Posts: 1477
Joined: Wed Sep 01, 2004 19:42

Re: Internet engine matches

Post by BertTuyt » Fri Aug 07, 2015 20:00

Fabien, did you ever measure the strength and depth reduction when you switch off probcut?

Bert

MichelG
Posts: 244
Joined: Sun Dec 28, 2003 20:24
Contact:

Re: Internet engine matches

Post by MichelG » Fri Aug 07, 2015 20:23

Fabien Letouzey wrote: Thanks Ed and Michel!

While I'm not complaining by any means, those results lead to a puzzle for me. Dragon Draughts already has the most accurate evaluation (through years of effort) and it's also gaining a lot from a breakthrough table that Scan doesn't have. I don't believe that my (very naive) ProbCut implementation can compensate for that if at all, so where are the missing Elo points coming from?

You don't have to answer right now BTW :)

Fabien.
It is on the top of my list to investigate that.

A couple of weeks ago i exchanged the dragon eval to that of scan 1.0 (with some restrictions), and concluded that dragon's original evaluation function was a lot better. Then i went on vacation, and Krzysztof's results indicated that scan played not as good as kingsrow and i left it alone. But now it is clear that scan is playing a class higher than dragon and kingsrow.

To investigate, i will again try dragon search combined with scan's eval, and put some more effort into that. If that combination wins massively, than it is clear that scan's eval is the cause, otherwise scan's search algorithm are the cause. Scan's smooth eval function may contribute to its good search as well.

It may also be that dragon's search algorithms are not particularly good or buggy, but that this is compensated by it's eval. I know that if i look at the search depth, dragon consistently searches shallower then kingsrow or damage. In any case i will try to find out.

Dragon's still contains some 40 parameters that are hand-tuned and a specialized 25 field breaktrough table. The hand-tuned parameters have been bugging me for a long time, but i never got around to getting rid of them.

It is now my plan to replace the hand-tuned parameters by machine learned ones.

In principle the breakthrough table could be 'absorbed' by ml-patterns as well; but attempts to do so did not gain strength. Maybe this is because the breakthrough table contains patterns of size 25 compared to the 16 of the patterns.

@rein: i don't think you can learn much from the individual patterns. It is the whole of them combined and the interaction between them that lead to the eval value.

Fabien Letouzey
Posts: 299
Joined: Tue Jul 07, 2015 07:48
Real name: Fabien Letouzey

Re: Internet engine matches

Post by Fabien Letouzey » Fri Aug 07, 2015 20:27

BertTuyt wrote:During the tournament, Fabien informed me that at one point Scan was searching deeper in comparison with Damage. This was remarkable, as Damage used 8 cores, at 4 GHz, and normally is searching much deeper as opponents (and sometimes throwing babies away with the pruning bathwater). Damage uses a combination of MCP, FHR and LMR.
I view Probcut as a kind of singular extension, except that it displays the maximum search depth instead of minimum (there are other differences too, of course). So the iteration number is very optimistic, and that's why I added the "average leaf depth" which I assumed would give more useful information.
So I guess that maybe also search depth is a major factor, maybe Ed and/or Michel observed the same.
Looking at the code, and I might have a better look, at least the probcut implementation looks interesting (although as Fabien explained the original statistiscs part is missing). Next to that I need to better understand the move sorting and LMR implementation. As Rein mentioned also the Q-search could be an explanation.
But for now, my bet would be the combination product, move sort and LMR.
Regarding pruning I'm confident that I gain not more than 100 Elo from it (compared to many hundreds in chess) at least in my testing conditions. Plus Dragon is not searching full-width, so the difference will be less (if at all). It would be easy to deactivate ProbCut in Scan and compare; LMR alone did well too (but slightly less than ProbCut alone).

In the end, it's possible that the answer is the usual "the Devil is in the details". But for me Scan is fairly simple and there are not many places details can hide into ...
Fabien, did you ever measure the strength and depth reduction when you switch off probcut?
Not directly, because LMR gained less. I tested ProbCut vs. full-width (say +80), LMR vs. full-width (+50) and ProbCut + LMR vs. ProbCut. (maybe +20). What you're asking would be ProbCut + LMR vs. LMR. So let's say +40. Those were not the final settings, though the results should be close.

Krzysztof Grzelak
Posts: 969
Joined: Thu Jun 20, 2013 17:16
Real name: Krzysztof Grzelak

Re: Internet engine matches

Post by Krzysztof Grzelak » Fri Aug 07, 2015 21:35

As for those matches with the program Kingsrow it is only Ed made me realize that I have a bad set program Scan. For which we sincerely apologize author of the program Scan - Fabiena Letouzey.

Ed Gilbert
Posts: 807
Joined: Sat Apr 28, 2007 14:53
Real name: Ed Gilbert
Location: Morristown, NJ USA
Contact:

Re: Internet engine matches

Post by Ed Gilbert » Fri Aug 07, 2015 22:40

I don't believe that my (very naive) ProbCut implementation can compensate for that if at all, so where are the missing Elo points coming from?
A very good question, one that many of us are scratching our heads to find the answer. In the case of kingsrow, after watching many match games, and running some other experiments, I think its the eval. Kingsrow does not use an auto tuned eval. In many of kingsrow's losses, scan is reporting high scores that kingrow is not, for many moves. It's quite remarkable to me that mostly just a set of patterns of 8 or 9 squares can see so much. In particular, I see no weakness in scan's handling of outpost squares (white 24, black 27). Many programs (kingsrow included) have had weaknesses in evaluating a man on these squares, and would lose games until a complicated eval was added that takes into consideration all of the attacking and defending squares behind the outpost. A pattern of 8 or 9 squares doesn't come close to seeing the whole picture. But somehow it works.

Kingsrow uses lmr and a pruning mechanism that is different than mpc. I tried replacing its pruning mechanism with scan's mpc, and didn't see any appreciable difference in depth searched. My lmr parameters are less aggressive than scan's, so I tried scan's. With this change kingsrow's nominal search depth was deeper, but average leaf depth didn't really change much. In general scan's nominal search depth is deeper than kingsrow, but average leaf depths are a lot closer. I then ran an engine match between this new test version of kingsrow using scan's lmr and mpc against a baseline kingsrow. The baseline code was better.

Kingsrow's search is MTD-f. It's very possible that settings that work well in PVS work differently in MTD.

I browsed the scan sources a bit and was curious about a couple of things, maybe you could comment on them. I didn't see any use of internal iterative deepening to get a good move for move ordering when there is no hashtable move. I'm sure I tested IID and found an improvement in node counts, and I thought it was kind of a standard technique in chess programs. Did you try it and find it didn't work well in scan?

Also, if I'm reading your history table code correctly, your changes in history table values for good moves and bad moves do not seem to have a depth component. I thought most chess programs use much bigger increments for deep interior nodes than for leaf nodes. I know I found this was effective in kingsrow.

-- Ed

Fabien Letouzey
Posts: 299
Joined: Tue Jul 07, 2015 07:48
Real name: Fabien Letouzey

Re: Internet engine matches

Post by Fabien Letouzey » Sat Aug 08, 2015 10:01

Krzysztof Grzelak wrote:As for those matches with the program Kingsrow it is only Ed made me realize that I have a bad set program Scan. For which we sincerely apologise author of the program Scan - Fabiena Letouzey.
No worries Krzysztof!

Fabien Letouzey
Posts: 299
Joined: Tue Jul 07, 2015 07:48
Real name: Fabien Letouzey

Re: Internet engine matches

Post by Fabien Letouzey » Sat Aug 08, 2015 10:09

Ed Gilbert wrote:A very good question, one that many of us are scratching our heads to find the answer. In the case of kingsrow, after watching many match games, and running some other experiments, I think its the eval. Kingsrow does not use an auto tuned eval. In many of kingsrow's losses, scan is reporting high scores that kingrow is not, for many moves. It's quite remarkable to me that mostly just a set of patterns of 8 or 9 squares can see so much. In particular, I see no weakness in scan's handling of outpost squares (white 24, black 27). Many programs (kingsrow included) have had weaknesses in evaluating a man on these squares, and would lose games until a complicated eval was added that takes into consideration all of the attacking and defending squares behind the outpost. A pattern of 8 or 9 squares doesn't come close to seeing the whole picture. But somehow it works.
Using patterns was a kind of a bet for me as I was unaware of Dragon already having them: do draughts have a similar structure as Othello? The reasoning was that because pieces move so slowly (compared to chess), local features might suffice. The answer appears to be "yes", though I have an arguably funny anecdote about this. The Scan learning algorithm says that it extracted about 20% of the data "knowledge". By comparison, a similar algorithm in Othello could extract 90% of the exact score (not just WLD)! So my first reaction when I saw the 20% (actually at the beginning it was less) is that the project was a failure and I nearly abandoned it (LOL). That was linear vs. logistic regressions though and the games were generated differently.

One puzzle piece is that patterns might be able to guess content outside their reach through correlation (and game phase). I think that rule constraints in Othello and to a lesser extent in draughts make this possible. Even without guessing, predicting an "average outcome" is probably much better than ignoring a feature entirely.

For complicated features, my wishful thinking is that patterns recognise the easy cases and let search (through ProbCut) focus on the unclear ones. In this view statistical evaluation gains a few plies everywhere, not just for outposts or other specific evaluation terms.

During the tournament I was uneasy with outposts. For instance in the TD-King game, I was glad when it was exchanged. That might not have been justified though.

Now regarding King's Row. The fact that it isn't using statistical evaluation makes its overall performance all the more impressive! I guess a lot of tuning can compensate for that, at the cost of making future changes more difficult because of local optimality.
Kingsrow uses lmr and a pruning mechanism that is different than mpc. I tried replacing its pruning mechanism with scan's mpc, and didn't see any appreciable difference in depth searched. My lmr parameters are less aggressive than scan's, so I tried scan's. With this change kingsrow's nominal search depth was deeper, but average leaf depth didn't really change much. In general scan's nominal search depth is deeper than kingsrow, but average leaf depths are a lot closer. I then ran an engine match between this new test version of kingsrow using scan's lmr and mpc against a baseline kingsrow. The baseline code was better.
It's only a guess but I would say that ProbCut relies on eval properties (such as correlation between shallow and deep searches) whereas LMR depends more on move ordering. So in your case, it would be easier to experiment with the latter (if I'm correct).
I browsed the scan sources a bit and was curious about a couple of things, maybe you could comment on them. I didn't see any use of internal iterative deepening to get a good move for move ordering when there is no hashtable move. I'm sure I tested IID and found an improvement in node counts, and I thought it was kind of a standard technique in chess programs. Did you try it and find it didn't work well in scan?
I could not measure an impact during Senpai's development and therefore put IID as "low priority" on my list for Scan. I haven't tried it. Furthermore I didn't write any tools to compare tree size (or TTD).
Also, if I'm reading your history table code correctly, your changes in history table values for good moves and bad moves do not seem to have a depth component. I thought most chess programs use much bigger increments for deep interior nodes than for leaf nodes. I know I found this was effective in kings row.
There are several non-pragmatic reasons. One is that, for me, history is only for low depth where the TT doesn't help much. Another, though I'm not exploiting it, is that I want my history scores to have a global interpretation: in this case a probability estimate. I think in Senpai what kind of update I was using didn't seem to matter much. I should have been more careful for Scan though because in chess we have a lot of candidates before we call history for help.

Ed Gilbert
Posts: 807
Joined: Sat Apr 28, 2007 14:53
Real name: Ed Gilbert
Location: Morristown, NJ USA
Contact:

Re: Internet engine matches

Post by Ed Gilbert » Sat Aug 08, 2015 14:24

Fabien, it's certainly fortunate that you didn't abandon the project early when you first saw the low extraction number!

As you noted, the endgame dbs don't change the elo very much. Like Michel, I started another engine match of 3-move ballots and using the 8pc db. The results so far are +2, -32, =239, so only a small improvement over the 6pc db. I'll post the final results when I get back from a short vacation on Wednesday or Thursday.

-- Ed

Rein Halbersma
Posts: 1700
Joined: Wed Apr 14, 2004 16:04
Contact:

Re: Internet engine matches

Post by Rein Halbersma » Sat Aug 08, 2015 16:40

Fabien Letouzey wrote: Using patterns was a kind of a bet for me as I was unaware of Dragon already having them: do draughts have a similar structure as Othello? The reasoning was that because pieces move so slowly (compared to chess), local features might suffice. The answer appears to be "yes", though I have an arguably funny anecdote about this. The Scan learning algorithm says that it extracted about 20% of the data "knowledge". By comparison, a similar algorithm in Othello could extract 90% of the exact score (not just WLD)! So my first reaction when I saw the 20% (actually at the beginning it was less) is that the project was a failure and I nearly abandoned it (LOL). That was linear vs. logistic regressions though and the games were generated differently.
What did you mean with 20% of the knowledge? Do you mean that the model predicts 20% of the variance in the game outcomes? My intuition would be that because wins/losses are so rare in draughts-selfplay that the residual variance of a model that predicts just draws would be quite small. So the eval has to try to explain that small piece of remaining knowledge. I wonder if that is the case and that the total remaining variance in draughts is not that much different from Othello?

Fabien Letouzey
Posts: 299
Joined: Tue Jul 07, 2015 07:48
Real name: Fabien Letouzey

Re: Internet engine matches

Post by Fabien Letouzey » Sat Aug 08, 2015 17:11

Rein Halbersma wrote:What did you mean with 20% of the knowledge? Do you mean that the model predicts 20% of the variance in the game outcomes? My intuition would be that because wins/losses are so rare in draughts-selfplay that the residual variance of a model that predicts just draws would be quite small. So the eval has to try to explain that small piece of remaining knowledge. I wonder if that is the case and that the total remaining variance in draughts is not that much different from Othello?
I was going to use the word "variance" but technically I'm using cross-entropy with logistic regression. In Othello, linear regression is appropriate because the game results can be viewed as "continuous" (though warping them can help).

In draughts I used material-only games with root-move shuffling (recall I never had an eval in the first place), so I'm not sure that there were so many draws. I can measure right now: it's (tada) 50%. I guess the original batch had fewer because search was not as good.

I also think that the error function is designed for win/loss only and doesn't give a perfect score when predicting draws accurately, although this is old stuff I remember from Othello. At that time I added a corrective term just for display but some things are better kept simple (and well established).

For the last question, I don't know. Othello looks pretty chaotic, especially in the endgame, so the difference between 90% and (I think it was at the beginning of Scan) 10-15% looked pretty striking. But it's probably a sum of independent factors among which, I am sure, game complexity.

Krzysztof Grzelak
Posts: 969
Joined: Thu Jun 20, 2013 17:16
Real name: Krzysztof Grzelak

Re: Internet engine matches

Post by Krzysztof Grzelak » Sat Aug 08, 2015 18:54

Match KINGSROW - SCAN (2-move ballots)

Kingsrow 1.56 vs. Scan 2.0 2 wins, 38 losses, 118 draws, 0 unknowns

Kingsrow

Opening Book = Best Moves
HashTable Size = 32 MB
DB cache Size = 4000 MB 6P
Threads = 2
Pondering = Off
Time = 1 Min / 100 Moves

Scan

Opening Book = book-margin = 2
DB cache Size = 2000 MB, 6P
Threads = 2
Pondering= Off
Time = 1 Min / 100 Moves

Match played on a computer with the equipment.


Processor - Intel Core i3 350M 2.27 GHz
Hard disc - HDD WD Scorpio Blue 1 TB
Memory of frames - 8 GB DDR3 1333
System - Windows 7 Home Premium 64 bit Service Pack 1 PL
Attachments
dxpgames.pdn
(177.19 KiB) Downloaded 148 times

Krzysztof Grzelak
Posts: 969
Joined: Thu Jun 20, 2013 17:16
Real name: Krzysztof Grzelak

Re: Internet engine matches

Post by Krzysztof Grzelak » Mon Aug 10, 2015 13:00

Match KINGSROW - SCAN (2-move ballots)

Kingsrow 1.56 vs. Scan 2.0 4 wins, 17 losses, 137 draws, 0 unknowns

Kingsrow

Opening Book = Best Moves
HashTable Size = 128 MB
DB cache Size = 4000 MB 6P
Threads = 2
Pondering = Off
Time = 5 Min / 100 Moves

Scan

Opening Book = book-margin = 2
DB cache Size = 2000 MB, 6P
Threads = 2
Pondering= Off
Time = 5 Min / 100 Moves

Match played on a computer with the equipment.

Processor - Intel Core i3 350M 2.27 GHz
Hard disc - HDD WD Scorpio Blue 1 TB
Memory of frames - 8 GB DDR3 1333
System - Windows 7 Home Premium 64 bit Service Pack 1 PL
Attachments
dxpgames.pdn
(177.19 KiB) Downloaded 149 times

Ed Gilbert
Posts: 807
Joined: Sat Apr 28, 2007 14:53
Real name: Ed Gilbert
Location: Morristown, NJ USA
Contact:

Re: Internet engine matches

Post by Ed Gilbert » Mon Aug 10, 2015 17:39

Krzysztof Grzelak wrote:Scan

Opening Book = book-margin = 2
DB cache Size = 2000 MB, 6P
Threads = 2
Pondering= Off
Time = 5 Min / 100 Moves
Krzyszysztof, it is probably a good idea to also show the "tt-size" used for scan, given how we have seen that an extreme value here can affect the results. Also, I'm not sure what is meant by "DB cache Size". I did not see a setting for scan's cache size in Fabien's documentation.

-- Ed

Krzysztof Grzelak
Posts: 969
Joined: Thu Jun 20, 2013 17:16
Real name: Krzysztof Grzelak

Re: Internet engine matches

Post by Krzysztof Grzelak » Mon Aug 10, 2015 19:14

Maybe exactly the way it should write.

Scan

Opening Book = book-margin = 2
End game = 6 Pieces
TT-size = 23
Threads = 2
Pondering= Off
Time = 5 Min / 100 Moves

Post Reply