Internet engine matches

 Posts: 1635
 Joined: Wed Apr 14, 2004 16:04
 Contact:
Re: Internet engine matches
Here's another one: it's easy and my program finds it within 1 second (and less than 0.5 million nodes)
Re: Internet engine matches
Yes Rein Damy needs also less than 1 second. The kind of combination is here very similar to the previous one isn't it?Rein Halbersma wrote:
Here's another one: it's easy and my program finds it within 1 second (and less than 0.5 million nodes)
BTW I easily recognize this position I proposed myself on this forum some time ago
Gérard

 Posts: 1635
 Joined: Wed Apr 14, 2004 16:04
 Contact:
Re: Internet engine matches
I did not remember it, but after you showed the other position by Sijbrands, this one and the previous one popped up in my head again. This one is from GM Scholma IIRC, and it was published almost 20 years ago in Sijbrands's magazine "Dammen".TAILLE wrote:Yes Rein Damy needs also less than 1 second. The kind of combination is here very similar to the previous one isn't it?Rein Halbersma wrote:
Here's another one: it's easy and my program finds it within 1 second (and less than 0.5 million nodes)
BTW I easily recognize this position I proposed myself on this forum some time ago
The last 2 positions are much easier (single/double sacrifice + move that makes a simple double threat) but the Sijbrands diagram is much harder because there it is single sacrifice + move with single threat, with after each reply a completely new and deep combination.
Re: Internet engine matches
Oops I am not able to really answer your questions because we have obviously different definitions for the depth of a tree or for a node/leaf. Essentially, this is the consequence of several factors:Rein Halbersma wrote:My program solves this position after search of 13 ply and a tree of 7.5 million nodes (= separate calls to search() function) and 7 seconds. My effective branching factor is around 3.38 for this position. What are some numbers for your search?TAILLE wrote:Rein,
This problem seems far easier. I do not have the fractions of second for my time measure, butRein Halbersma wrote:
Gerard, another challenge for your new algorithm: how long does it take to find the winning move?
Damy resolves the problem (3430, 5044) in less than 2 seconds. Do you have similar measure?
1) the extention/reduction/pruning mechanism make a single value for the depth of the tree completly irrelevant. In the above position the sequence 3430 35x24 represents 2 plies but for Damy the depth is only one ply because of the forced second ply, and in the other hand Damy, and I guess it is the same for your programm (BTW what is the name of your program?), executes of course various reduction during the search.
2) even for a leaf of the tree we cannot have the same definition. As an exemple the Damy eval function executes some (recursive) micro searchs in order to discover a breakthrough or in order to discover a weak outpost (I perfectly know that some programmers prefer to build eval table. I tried also this approach but eventually I changed my mind). These microsearchs are very important in my implementation because they allow to discover a winning strategy by keeping the depth of the main tree as low as possible.
Anyway I do not want to elude your question.
For Damy the minimum number of plies to solve this problem is 6 plies
3430 (1st ply) 35x24* 5044 (2nd ply) xxxx (3rd ply) 4440 (4th ply) 45x34* 3227 ou 3228 (5th ply) x 4339 (6th ply) x x
Due to my reduction mechanism Damy is unable to discover such sequence with an initial depth equal to 6. I have to wait for depth = 9 to see Damy discovering this winning sequence in less than 2 seconds. You can then easily conclude that some branches where reduced by at least 3 plies!
Could you also explain what you mean exactly by 13 plies in your implementation?
Gérard
Re: Internet engine matches
As my machine was constantly working I was not able to post the Damage results regarding the positions posted.
So here is the first one.
Note timing based on a 2.93 Ghz i940, and with 1 core search ...
After 0.94 sec Damage also has the right score ( 14 Ply search )
Bert
So here is the first one.
Note timing based on a 2.93 Ghz i940, and with 1 core search ...
After 0.02 sec Damage finds the right move ( 7 Ply search)Hi,
After quite a very long time I have just managed to reach a quite stable new search algorithm for Damy.
Can you tell me how many time your program needs to resolve the following Sijbrands composition?
White to move : +1
[
After 0.94 sec Damage also has the right score ( 14 Ply search )
Bert
Re: Internet engine matches
And here the 1st Rein challenge:
Bert
Damage finds the right Move and Score after 0.08 sec (Ply 10).Gerard, another challenge for your new algorithm: how long does it take to find the winning move?
Bert
Re: Internet engine matches
And the 2nd challenge of Rein
Bert
Damage finds the right move after 0.02 sec Ply 8 and around 30K nodes.Here's another one: it's easy and my program finds it within 1 second (and less than 0.5 million nodes)
Bert
Re: Internet engine matches
I didn't post for some time the reason i was playing Engine matches
After 2 won matches, i started to test IID, which was not a success.
Then I restarted with the old source (at least that was what I thought) to get a feeling for statistics.
But for an unknown reason Damage was not able to win a 158 games match anymore
Also the other matches revealed little differences
See below 8 Engine Matches.
I don't have a good explanation yet?
Based on code comparison , I could not find any clue so far.
What I also cant imagine is that the first 2 results were statistical fluctuations, or that Kingsrow sometimes has a bad day (or days), or that during initialization something weird can happen with Kingsrow with a lower strength as a result. Maybe Ed has more experience.
Anyway, the difference is not so dramatic, but I'm puzzled.
Bert
After 2 won matches, i started to test IID, which was not a success.
Then I restarted with the old source (at least that was what I thought) to get a feeling for statistics.
But for an unknown reason Damage was not able to win a 158 games match anymore
Also the other matches revealed little differences
See below 8 Engine Matches.
I don't have a good explanation yet?
Based on code comparison , I could not find any clue so far.
What I also cant imagine is that the first 2 results were statistical fluctuations, or that Kingsrow sometimes has a bad day (or days), or that during initialization something weird can happen with Kingsrow with a lower strength as a result. Maybe Ed has more experience.
Anyway, the difference is not so dramatic, but I'm puzzled.
Code: Select all
Date W L D U P P%
16sep2012 7 3 148 0 162 51,3%
28sep2012 14 7 137 0 165 52,2%
1Oct2012 4 11 143 0 151 47,8%
10oct2012 2 8 148 0 152 48,1%
13oct2012 5 11 141 1 151 48,1%
16oct2002 4 11 143 0 151 47,8%
19Oct2012 3 9 146 0 152 48,1%
23Oct2012 5 9 143 1 153 48,7%

 Posts: 1635
 Joined: Wed Apr 14, 2004 16:04
 Contact:
Re: Internet engine matches
Try and run the first 2 matches through BayesELo, compute rating difference, confidence interval and likelihood of superiority. Then repeat with all matches. You'll be surprised how big the rating uncertainty is from short (~300) matches with differences in the 1015 ELO range.BertTuyt wrote:I didn't post for some time the reason i was playing Engine matches
After 2 won matches, i started to test IID, which was not a success.
Then I restarted with the old source (at least that was what I thought) to get a feeling for statistics.
But for an unknown reason Damage was not able to win a 158 games match anymore
Also the other matches revealed little differences
See below 8 Engine Matches.
I don't have a good explanation yet?
Based on code comparison , I could not find any clue so far.
What I also cant imagine is that the first 2 results were statistical fluctuations, or that Kingsrow sometimes has a bad day (or days), or that during initialization something weird can happen with Kingsrow with a lower strength as a result. Maybe Ed has more experience.
Anyway, the difference is not so dramatic, but I'm puzzled.
BertCode: Select all
Date W L D U P P% 16sep2012 7 3 148 0 162 51,3% 28sep2012 14 7 137 0 165 52,2% 1Oct2012 4 11 143 0 151 47,8% 10oct2012 2 8 148 0 152 48,1% 13oct2012 5 11 141 1 151 48,1% 16oct2002 4 11 143 0 151 47,8% 19Oct2012 3 9 146 0 152 48,1% 23Oct2012 5 9 143 1 153 48,7%
UPDATE: just a quick calculation to confirm that. Based on the first 2 matches alone, Damage scored +12 ELO with an error margin of +/ 6 ELO (1.99 sigma result). That meant that Damage was 97.7% likely to be the stronger engine. Still that leaves a 2.3% chance that Kingsrow was stronger, so it's a reasonable but not a completely convincing show of superiority from Damage. But in the remaining matches, Kingsrow scored +14 ELO with an error margin of +/ 3.5 ELO (4.00 sigma result), and the likelihood that Kingsrow is superior is almost 100% (less than 1 in 30,000 chance that Damage is stronger).
You can also compute how likely it is that you win a 158 game match, given that Kingsrow in reality is 14 ELO stronger over 910 games. If I'm not mistaken, that probability was about 1 in a thousand (3.1 sigma result). So you were indeed very lucky to win the first 2 matches (if they were with identical versions).
Morale: statistically speaking, a 300 game match is better than a 9 game tournament, but luck can still influence the result. In particular, you need to test over a lot more games before you conclude with, say, 99% confidence that a positive match score shows that your program is superior. This is of course well known in the chess engine community.
Re: Internet engine matches
Statistics is hard
In fact, if you see any result anywhere that says statement X was proven with 95% (2 sigma) chance, then statement X is probably not true. Sadly this happens in a lot of fields, even in cancer research, because of the misuse of statistics.
Yesterday i called my mom:
me: mom, something remarkable happend!
mom: tell me!
me: all coins in my pocket have 2 heads!
mom: why do you think that?
me: i flipped 3 of the coins, and all 3 coins where heads! If coins had both head and tails, there would be 87.5% chance of at least turning up one tail.
mom: so you are only 87.5% sure that all coins in your pocket are heads?
me: remarkable isn't it?
mom: why don't you flip another coin and be more sure?
me: well, that's a lot of effort. And i already am very confident. 87.5% is a lot!
mom: would you have called me if your 3 filps where all tails?
me: ofcourse, then i would be 87.5% sure that all my coins where tails
mom: so there is a 1 in 4 chance that you would have called me?
me: i guess
mom: then there is a 1 in 4 chance that you would have found something remarkable isn't it? You can be only 75% sure that you found something.
me: i quess
mom: would you have called when it was 2 tails and 1 head?
me: no
mom: every time you call me about your coins, later it turns out to be just regular coins.
me: i just wanted to hear your voice...
Reins reasoning that there is only a very small chance of such a fluke is the right answer to the wrong question. Try answering this one:
how big is a chance, when playing a match of games that you end at an selfchosen point, gives a result that is remarkable?
I won't do the math, but it's goiing to be fairly big.
In fact, if you see any result anywhere that says statement X was proven with 95% (2 sigma) chance, then statement X is probably not true. Sadly this happens in a lot of fields, even in cancer research, because of the misuse of statistics.
Yesterday i called my mom:
me: mom, something remarkable happend!
mom: tell me!
me: all coins in my pocket have 2 heads!
mom: why do you think that?
me: i flipped 3 of the coins, and all 3 coins where heads! If coins had both head and tails, there would be 87.5% chance of at least turning up one tail.
mom: so you are only 87.5% sure that all coins in your pocket are heads?
me: remarkable isn't it?
mom: why don't you flip another coin and be more sure?
me: well, that's a lot of effort. And i already am very confident. 87.5% is a lot!
mom: would you have called me if your 3 filps where all tails?
me: ofcourse, then i would be 87.5% sure that all my coins where tails
mom: so there is a 1 in 4 chance that you would have called me?
me: i guess
mom: then there is a 1 in 4 chance that you would have found something remarkable isn't it? You can be only 75% sure that you found something.
me: i quess
mom: would you have called when it was 2 tails and 1 head?
me: no
mom: every time you call me about your coins, later it turns out to be just regular coins.
me: i just wanted to hear your voice...
Reins reasoning that there is only a very small chance of such a fluke is the right answer to the wrong question. Try answering this one:
how big is a chance, when playing a match of games that you end at an selfchosen point, gives a result that is remarkable?
I won't do the math, but it's goiing to be fairly big.

 Posts: 1635
 Joined: Wed Apr 14, 2004 16:04
 Contact:
Re: Internet engine matches
Michel,MichelG wrote:Statistics is hard
In fact, if you see any result anywhere that says statement X was proven with 95% (2 sigma) chance, then statement X is probably not true. Sadly this happens in a lot of fields, even in cancer research, because of the misuse of statistics.
Reins reasoning that there is only a very small chance of such a fluke is the right answer to the wrong question. Try answering this one:
how big is a chance, when playing a match of games that you end at an selfchosen point, gives a result that is remarkable?
I won't do the math, but it's goiing to be fairly big.
What exactly are you trying to say? How would you propose to test for engine improvements? Do you see any role for statistics there?
Rein
Re: Internet engine matches
The point is that a 95% significant statistic result does not mean that there is a 95% chance that something is true. Looking at bert's table for instance, i don't think you can conclude that the version of match 1 & 2 outperformed the version in the later matches.Rein Halbersma wrote:
Michel,
What exactly are you trying to say? How would you propose to test for engine improvements? Do you see any role for statistics there?
Rein
The thing to learn is, that if you want to know if a change to your code is actually working, make sure to play enough games to get a high statistical confidence level (e.g 3 or 4 sigma, or 99.9%)
158 games is just not enough to 'prove' anything.

 Posts: 771
 Joined: Sat Apr 28, 2007 14:53
 Real name: Ed Gilbert
 Location: Morristown, NJ USA
 Contact:
Re: Internet engine matches
The number of games needed to be confident of superiority depends on the relative strengths of the programs. 10 or 20 games is enough in the case of a severe mismatch. 158 games seems to be enough for kingsrow vs flits or truus. But when I test a new version of kingsrow vs a baseline, I use 7904 games (eight 3move matches) and sometimes even that is not enough and I repeat it a few times.
 Ed
 Ed
Re: Internet engine matches
Herewith the updated match table, with 3 additional matches added.
Still not found the previous magic win button, but some small minor modifications seem to work.
The last 3 matches seem to show an ELO difference below 10 (if my calculation is valid it was around 7)
Bert
Still not found the previous magic win button, but some small minor modifications seem to work.
Code: Select all
Date W L D U P P%
16sep2012 7 3 148 0 162 51,3%
28sep2012 14 7 137 0 165 52,2%
1Oct2012 4 11 143 0 151 47,8%
10oct2012 2 8 148 0 152 48,1%
13oct2012 5 11 141 1 151 48,1%
16oct2002 4 11 143 0 151 47,8%
19Oct2012 3 9 146 0 152 48,1%
23Oct2012 5 9 143 1 153 48,7%
25Oct2012 2 5 150 1 154 49,0%
28Oct2012 5 8 145 0 155 49,1%
30Oct2012 3 5 150 0 156 49,4%
Bert
 Attachments

 dxpgames Oct2012 v9.pdn
 (158.42 KiB) Downloaded 86 times

 dxpgames Oct2012 v8.pdn
 (157.85 KiB) Downloaded 109 times

 dxpgames Oct2012 v7.pdn
 (156.06 KiB) Downloaded 99 times

 Posts: 1635
 Joined: Wed Apr 14, 2004 16:04
 Contact:
Re: Internet engine matches
Bert,BertTuyt wrote: After 2 won matches, i started to test IID, which was not a success.
Then I restarted with the old source (at least that was what I thought) to get a feeling for statistics.
Do you have your program under version control? It should not be hard to get back the exact version that won the 2 matches.
Rein