“Excellent” (with the finger tapping)
@bart, I wanted to play with your simulator, but results are not correct with always only one section. Is there anything special to do to make it work or is your repository currently in a bad state?
@mav, you have entirely rewritten the test in Go, that’s very impressive.
I had originally developed the simulation for a few different purposes before the ageing changes came out. I have some big plans for the simulation codebase. Hopefully those results will come to fruition early next year. So far ageing has taken a good chunk of time away from the original intentions (but in the best possible way).
@tfa, Yes - that’s because it simulates an older set of rules that would cause such behaviour. We have changed the rules since then, but there wasn’t time to update the simulator. I’ll probably get to that at some point, but for now we have some tasks with higher priority. Sorry it’s not as useful as it could be!
What part of the OP were you referring to? The “Split” section, “Static group and quorum sizes”, somewhere else?
It’s in the Block States > Dead(PeerId) > Trigger for voting:
“we take the Hash of the event H. Then if H % 2^age == 0 for any peer (sorted by age ascending) in our section, we relocate this node to the neighbour that has the lowest number of peers.”
The bold text indicates younger nodes are preferred for relocation. But rfc0045 says prefer older nodes, so there is a conflict.
I think I have found the problems in the algorithm:
-
The sections that have a shorter prefix should be chosen in priority as target of relocations, even if they have more nodes. This is the main problem and it is only one line to modify.
-
There is a mysterious hard coded rule that suppress node ageing when the event concerns a non-adult node in a section with a prefix length > 4 that doesn’t do any good.
With these modifications the simulation is beginning to fly (with 75 sections created). It is not yet perfect because there are still a lot of young node rejections (70%), but the situation is unblocked.
I also think that relocations implementing a change of owning sections during a split or a merge shouldn’t trigger other relocations in cascade to avoid destabilizing newly created sections. This means that the Gone and Live events generated by a split or merge operations shouldn’t count for node ageing. The result is almost the same: 71 sections but there are less merge events (280 instead of 377) which demonstrates a more stable network. This one is debatable.
I have just uploaded my modifications in my fork, but I haven’t analyzed all the consequences. If you want to study them.
The bold text indicates younger nodes are preferred for relocation. But rfc0045 says prefer older nodes, so there is a conflict.
Thanks for the clarification. Instinctively, I would think that you would want to relocate an elder rather than a young node to an unhealthy neighbouring section. The reason for this is that the if a neighbour is already in trouble, you don’t want to send an untrusted adversary its way. This seems like a preferred option as long as the section from which the chosen elder is leaving has enough elders to remain healthy. I may have my thinking backward on this, but this is the sentiment that first comes to mind…
I would think that you would want to relocate an elder rather than a young node to an unhealthy neighbouring section. The reason for this is that the if a neighbour is already in trouble, you don’t want to send an untrusted adversary its way.
Agreed. I would maybe rephrase ‘untrusted adversary’ to a ‘lesser trusted adversary’ but otherwise I think the reasoning is ok.
Also there is this reason:
The rfc calls for an “exponentially increasing period between such relocations” which youngest-preferred does not achieve
It’s much harder than exponentially difficult.
The sections that have a shorter prefix should be chosen in priority as target of relocations, even if they have more nodes.
I didn’t explain why this is needed: what we observe currently during the simulations are successive splits that create a completely unbalanced network, for example () => 0 & 1, 0 => 00 & 01, 01 => 010 & 011, 010 => 0100 & 0101, 0101 => 01010 & 01011. Here there are 6 sections: 1, 00, 011, 0100, 01010 and 01011 with distinct prefix lengths instead of similar ones. At one time the shorter prefix needs a merge which triggers a generalized merge of all the sections that recreates the section with zero length prefix ().
And then the cycle begins again. This creates a oscillating phenomenon from 1 section to a few ones.
By selecting the target section for relocation with the shorter prefix, we insure that a section with a long prefix won’t be split and so an unbalanced network is not created.
Of course, if there are several sections with a short prefix then the smaller section must be selected among them.
Nice catch, Thierry. Avoid, at all costs, a cascade merging, the greatest fear, seems the most reasonable option.
We can have both if the destination section generated from your proposal (either 1 or 2) changes the relocated address to another one belonging to a neighbouring section of the destination section more in need of a new node.
And adding a step in the random relocation, forcing the new node to find a key in the neighbouring section with shorter prefix, could be the best of both worlds.
Hey @tfa, thanks for this! @bart is away for a couple of days, so I’ll be looking into this in the meantime. Let me get back to you once I have something.
Hey @tfa,
Your modifications certainly made things better, but it still isn’t enough. For example, we are still seeing sections of ~1000 nodes, which is way too much (ideal section size is between 20 - 30 nodes). We are currently trying different approach and will update as soon as we have some results.
Yes, the algorithm must be improved because sections are still too big and there are too many node rejections, but I don’t observe as big sections as you do. With default settings (no arguments given) I get 71 sections with 154 nodes on average (min size is 33 and max is 431).
This is the result with my fork, but you have made some modifications since the fork, I will try to merge them when I have time, to see how things evolve.
My random seed is: AGE_SEED="[1070056145, 2737779209, 2756500248, 3832683655]"
Complete result is:
Prefix | Count |
---|---|
0000000 | 67 |
0000001 | 102 |
000001 | 201 |
000010 | 238 |
000011 | 222 |
000100 | 140 |
0001010 | 92 |
0001011 | 78 |
0001100 | 79 |
0001101 | 120 |
000111 | 224 |
001000 | 184 |
001001 | 247 |
001010 | 136 |
001011 | 153 |
0011000 | 55 |
0011001 | 121 |
001101 | 204 |
001110 | 153 |
0011110 | 120 |
0011111 | 99 |
010000 | 243 |
010001 | 212 |
010010 | 119 |
010011 | 120 |
010100 | 149 |
010101 | 177 |
010110 | 127 |
010111 | 126 |
011000 | 134 |
011001 | 169 |
011010 | 123 |
0110110 | 42 |
0110111 | 66 |
011100 | 132 |
0111010 | 33 |
0111011 | 88 |
01111 | 306 |
1000000 | 43 |
1000001 | 64 |
100001 | 137 |
1000100 | 96 |
1000101 | 136 |
100011 | 256 |
100100 | 164 |
1001010 | 122 |
1001011 | 132 |
100110 | 138 |
100111 | 146 |
101000 | 135 |
101001 | 138 |
10101 | 428 |
101100 | 136 |
101101 | 137 |
101110 | 215 |
101111 | 198 |
110000 | 168 |
110001 | 230 |
11001 | 431 |
110100 | 136 |
110101 | 136 |
110110 | 145 |
110111 | 204 |
111000 | 132 |
111001 | 136 |
111010 | 169 |
111011 | 219 |
111100 | 135 |
111101 | 134 |
111110 | 177 |
111111 | 213 |
Average | 154,75 |
I have rebased my code on Maidsafe new version and I got similar results. The number of sections is now only 65, which is not very far from previuos result (71). Maybe the difference is due to the new drop probability distribution, which I suppose is a better model of outside world (not under vaults control).
I also noticed that the age penalty for a rejoining node is now less severe (minus 1 instead of divided by 2). Maybe this is a good thing for people who need to restart their vault.
After that I made an experiment and added 1 to the number of trailing zeros so that a relocation is done at each event and not just every other event. The results are better (but still not perfect):
-
57843 rejections (instead of 70874)
-
102 sections (instead of 65)
Then I replaced boolean norejectyoung parameter by a numerical one that indicates the number of allowed young peers in a section. Modifying this parameter can generate slightly better results:
Max Young | Rejections | Sections |
---|---|---|
1 | 57843 | 102 |
2 | 56780 | 119 |
3 | 56242 | 104 |
4 | 56002 | 83 |
We can observe that the optimum is to allow 2 young nodes (age = 1) in a section.
I have updated my repository with this code and the specific command generate this configuration is:
AGE_SEED="[1070056145, 2737779209, 2756500248, 3832683655]" ./target/debug/ageing_sim --max_young=2
Complete results for this configuration are:
Prefix | Count |
---|---|
0000000 | 129 |
0000001 | 260 |
0000010 | 193 |
0000011 | 286 |
0000100 | 139 |
0000101 | 269 |
000011 | 562 |
0001000 | 131 |
0001001 | 244 |
0001010 | 120 |
0001011 | 196 |
00011000 | 19 |
00011001 | 70 |
0001101 | 117 |
0001110 | 130 |
0001111 | 144 |
0010000 | 141 |
0010001 | 212 |
001001 | 468 |
0010100 | 120 |
0010101 | 245 |
0010110 | 226 |
0010111 | 236 |
0011000 | 178 |
0011001 | 203 |
0011010 | 155 |
0011011 | 168 |
001110 | 437 |
001111 | 562 |
01000000 | 78 |
01000001 | 112 |
0100001 | 244 |
0100010 | 151 |
0100011 | 244 |
010010 | 392 |
01001100 | 74 |
01001101 | 121 |
0100111 | 192 |
0101000 | 142 |
0101001 | 219 |
0101010 | 187 |
0101011 | 179 |
0101100 | 172 |
0101101 | 231 |
0101110 | 97 |
0101111 | 112 |
0110000 | 122 |
0110001 | 144 |
0110010 | 170 |
0110011 | 173 |
0110100 | 197 |
0110101 | 187 |
011011 | 440 |
011100 | 438 |
0111010 | 94 |
0111011 | 94 |
0111100 | 149 |
0111101 | 191 |
0111110 | 139 |
0111111 | 154 |
1000000 | 132 |
1000001 | 201 |
100001 | 560 |
10001000 | 56 |
10001001 | 77 |
1000101 | 173 |
1000110 | 130 |
1000111 | 136 |
100100 | 420 |
1001010 | 138 |
1001011 | 192 |
1001100 | 131 |
1001101 | 170 |
1001110 | 77 |
1001111 | 114 |
1010000 | 141 |
1010001 | 217 |
1010010 | 175 |
1010011 | 280 |
1010100 | 61 |
1010101 | 124 |
1010110 | 120 |
1010111 | 166 |
1011000 | 131 |
1011001 | 158 |
101101 | 467 |
1011100 | 192 |
1011101 | 332 |
101111 | 495 |
110000 | 392 |
1100010 | 168 |
1100011 | 192 |
1100100 | 177 |
1100101 | 163 |
11001100 | 13 |
11001101 | 48 |
1100111 | 69 |
1101000 | 131 |
1101001 | 160 |
110101 | 401 |
1101100 | 144 |
1101101 | 174 |
110111 | 468 |
111000 | 444 |
1110010 | 205 |
1110011 | 225 |
1110100 | 189 |
11101010 | 110 |
11101011 | 151 |
1110110 | 154 |
1110111 | 228 |
1111000 | 142 |
1111001 | 148 |
11110100 | 65 |
11110101 | 133 |
1111011 | 193 |
111110 | 438 |
1111110 | 206 |
1111111 | 292 |
Average | 200 |
Cheers for the testing, just a small point.
I suspect age start at 4 is actually here we want to be. 1-3 is just work for the network, but 4 means a node does some work and the network does minimum.
No problem, just ask .
I tested with init_age = 4 and different values for max_young parameters. The results were not good.
Then I realized that, in my modification to insure a relocation is done for every event, I must add init_age instead of simply 1. With this modification the results are much better:
-
Rejections: 44357
-
Sections: 2073
-
Average section size: 19.33
But in return the simulation is now much longer to execute (about 53 mn), so I have only tested with max_young=1. I will test other values later. But I think that a great step has already been made.
Good numbers and excellent work.
If possible, can list the nodes by final age?
Some more numbers:
I have tested with different values for max_young parameter and the results are very good:
Max Young | Sections | Rejections |
---|---|---|
∞ | 2360 | 0 |
1 | 2073 | 44357 |
2 | 3642 | 31845 |
3 | 4042 | 25093 |
4 | 4122 | 21000 |
5 | 4108 | 17795 |
6 | 4125 | 15389 |
7 | 4098 | 13608 |
8 | 4088 | 11936 |
9 | 4080 | 10745 |
10 | 4084 | 9651 |
There is large plateau with more than 4000 sections. This means we don’t need to be very precise on the number of young nodes allowed in a section, though we cannot completely remove the control, as demonstrated by the first line.
Then I have specifically analyzed the results with max_young=6 (meaning that 6 young nodes are allowed in a section). Here are different metrics computed from the number of nodes in each section, group by prefix length:
Prefix len | Count | Average | Min | Max |
---|---|---|---|---|
11 | 55 | 27.62 | 15 | 48 |
12 | 3902 | 16.58 | 10 | 32 |
13 | 168 | 14.10 | 10 | 22 |
All | 4125 | 16.62 | 10 | 48 |
Count is the number of sections having the prefix length (for example there are 55 sections with prefix length 11), and average, min, max are aggregates values computed on the number of nodes of these sections (for example the average size of these 55 sections is 27.62 nodes).
Lastly, for @digipl, the nodes by final age in this configuration:
Age | Count |
---|---|
1 | 0 |
2 | 0 |
3 | 0 |
4 | 14422 |
5 | 17373 |
6 | 16861 |
7 | 9551 |
8 | 5422 |
9 | 2702 |
10 | 1429 |
11 | 626 |
12 | 149 |
13 | 28 |
14 | 3 |
15 | 1 |
Then I have specifically analyzed the results with max_young=6
Excellent work again
Just wondering what the plateau situation is like in comparison. Is it all length 12? Is there some “control system” mechanism that drives the optional prefix length to be 12 and sections eventually become mostly prefix length 12.
I think it might be beneficial to see that analysis for all above max_young == 6 and at least for max_young == 10