Data Chains - Deeper Dive

rob · December 26, 2017, 11:26pm

“Excellent” (with the finger tapping)

tfa · December 27, 2017, 11:04pm

@bart, I wanted to play with your simulator, but results are not correct with always only one section. Is there anything special to do to make it work or is your repository currently in a bad state?

@mav, you have entirely rewritten the test in Go, that’s very impressive.

mav · December 28, 2017, 2:28am

I had originally developed the simulation for a few different purposes before the ageing changes came out. I have some big plans for the simulation codebase. Hopefully those results will come to fruition early next year. So far ageing has taken a good chunk of time away from the original intentions (but in the best possible way).

bart · December 28, 2017, 8:02am

@tfa, Yes - that’s because it simulates an older set of rules that would cause such behaviour. We have changed the rules since then, but there wasn’t time to update the simulator. I’ll probably get to that at some point, but for now we have some tasks with higher priority. Sorry it’s not as useful as it could be!

jlpell · January 16, 2018, 10:23pm

What part of the OP were you referring to? The “Split” section, “Static group and quorum sizes”, somewhere else?

mav · January 16, 2018, 10:56pm

It’s in the Block States > Dead(PeerId) > Trigger for voting:

“we take the Hash of the event H. Then if H % 2^age == 0 for any peer (sorted by age ascending) in our section, we relocate this node to the neighbour that has the lowest number of peers.”

The bold text indicates younger nodes are preferred for relocation. But rfc0045 says prefer older nodes, so there is a conflict.

tfa · January 17, 2018, 12:07am

I think I have found the problems in the algorithm:

The sections that have a shorter prefix should be chosen in priority as target of relocations, even if they have more nodes. This is the main problem and it is only one line to modify.
There is a mysterious hard coded rule that suppress node ageing when the event concerns a non-adult node in a section with a prefix length > 4 that doesn’t do any good.

With these modifications the simulation is beginning to fly (with 75 sections created). It is not yet perfect because there are still a lot of young node rejections (70%), but the situation is unblocked.

I also think that relocations implementing a change of owning sections during a split or a merge shouldn’t trigger other relocations in cascade to avoid destabilizing newly created sections. This means that the Gone and Live events generated by a split or merge operations shouldn’t count for node ageing. The result is almost the same: 71 sections but there are less merge events (280 instead of 377) which demonstrates a more stable network. This one is debatable.

I have just uploaded my modifications in my fork, but I haven’t analyzed all the consequences. If you want to study them.

jlpell · January 17, 2018, 12:19am

Thanks for the clarification. Instinctively, I would think that you would want to relocate an elder rather than a young node to an unhealthy neighbouring section. The reason for this is that the if a neighbour is already in trouble, you don’t want to send an untrusted adversary its way. This seems like a preferred option as long as the section from which the chosen elder is leaving has enough elders to remain healthy. I may have my thinking backward on this, but this is the sentiment that first comes to mind…

mav · January 17, 2018, 3:40am

Agreed. I would maybe rephrase ‘untrusted adversary’ to a ‘lesser trusted adversary’ but otherwise I think the reasoning is ok.

Also there is this reason:

tfa · January 17, 2018, 6:48am

I didn’t explain why this is needed: what we observe currently during the simulations are successive splits that create a completely unbalanced network, for example () => 0 & 1, 0 => 00 & 01, 01 => 010 & 011, 010 => 0100 & 0101, 0101 => 01010 & 01011. Here there are 6 sections: 1, 00, 011, 0100, 01010 and 01011 with distinct prefix lengths instead of similar ones. At one time the shorter prefix needs a merge which triggers a generalized merge of all the sections that recreates the section with zero length prefix ().

And then the cycle begins again. This creates a oscillating phenomenon from 1 section to a few ones.

By selecting the target section for relocation with the shorter prefix, we insure that a section with a long prefix won’t be split and so an unbalanced network is not created.

Of course, if there are several sections with a short prefix then the smaller section must be selected among them.

digipl · January 17, 2018, 4:05pm

Nice catch, Thierry. Avoid, at all costs, a cascade merging, the greatest fear, seems the most reasonable option.

And adding a step in the random relocation, forcing the new node to find a key in the neighbouring section with shorter prefix, could be the best of both worlds.

madadam · January 18, 2018, 12:08pm

Hey @tfa, thanks for this! @bart is away for a couple of days, so I’ll be looking into this in the meantime. Let me get back to you once I have something.

madadam · January 19, 2018, 11:09am

Hey @tfa,

Your modifications certainly made things better, but it still isn’t enough. For example, we are still seeing sections of ~1000 nodes, which is way too much (ideal section size is between 20 - 30 nodes). We are currently trying different approach and will update as soon as we have some results.

tfa · January 19, 2018, 1:10pm

Yes, the algorithm must be improved because sections are still too big and there are too many node rejections, but I don’t observe as big sections as you do. With default settings (no arguments given) I get 71 sections with 154 nodes on average (min size is 33 and max is 431).

This is the result with my fork, but you have made some modifications since the fork, I will try to merge them when I have time, to see how things evolve.

My random seed is: AGE_SEED="[1070056145, 2737779209, 2756500248, 3832683655]"

Complete result is:

Prefix	Count
0000000	67
0000001	102
000001	201
000010	238
000011	222
000100	140
0001010	92
0001011	78
0001100	79
0001101	120
000111	224
001000	184
001001	247
001010	136
001011	153
0011000	55
0011001	121
001101	204
001110	153
0011110	120
0011111	99
010000	243
010001	212
010010	119
010011	120
010100	149
010101	177
010110	127
010111	126
011000	134
011001	169
011010	123
0110110	42
0110111	66
011100	132
0111010	33
0111011	88
01111	306
1000000	43
1000001	64
100001	137
1000100	96
1000101	136
100011	256
100100	164
1001010	122
1001011	132
100110	138
100111	146
101000	135
101001	138
10101	428
101100	136
101101	137
101110	215
101111	198
110000	168
110001	230
11001	431
110100	136
110101	136
110110	145
110111	204
111000	132
111001	136
111010	169
111011	219
111100	135
111101	134
111110	177
111111	213
Average	154,75

tfa · January 19, 2018, 10:26pm

I have rebased my code on Maidsafe new version and I got similar results. The number of sections is now only 65, which is not very far from previuos result (71). Maybe the difference is due to the new drop probability distribution, which I suppose is a better model of outside world (not under vaults control).

I also noticed that the age penalty for a rejoining node is now less severe (minus 1 instead of divided by 2). Maybe this is a good thing for people who need to restart their vault.

After that I made an experiment and added 1 to the number of trailing zeros so that a relocation is done at each event and not just every other event. The results are better (but still not perfect):

57843 rejections (instead of 70874)
102 sections (instead of 65)

Then I replaced boolean norejectyoung parameter by a numerical one that indicates the number of allowed young peers in a section. Modifying this parameter can generate slightly better results:

Max Young	Rejections	Sections
1	57843	102
2	56780	119
3	56242	104
4	56002	83

We can observe that the optimum is to allow 2 young nodes (age = 1) in a section.

I have updated my repository with this code and the specific command generate this configuration is:

AGE_SEED="[1070056145, 2737779209, 2756500248, 3832683655]" ./target/debug/ageing_sim --max_young=2

Complete results for this configuration are:

Prefix	Count
0000000	129
0000001	260
0000010	193
0000011	286
0000100	139
0000101	269
000011	562
0001000	131
0001001	244
0001010	120
0001011	196
00011000	19
00011001	70
0001101	117
0001110	130
0001111	144
0010000	141
0010001	212
001001	468
0010100	120
0010101	245
0010110	226
0010111	236
0011000	178
0011001	203
0011010	155
0011011	168
001110	437
001111	562
01000000	78
01000001	112
0100001	244
0100010	151
0100011	244
010010	392
01001100	74
01001101	121
0100111	192
0101000	142
0101001	219
0101010	187
0101011	179
0101100	172
0101101	231
0101110	97
0101111	112
0110000	122
0110001	144
0110010	170
0110011	173
0110100	197
0110101	187
011011	440
011100	438
0111010	94
0111011	94
0111100	149
0111101	191
0111110	139
0111111	154
1000000	132
1000001	201
100001	560
10001000	56
10001001	77
1000101	173
1000110	130
1000111	136
100100	420
1001010	138
1001011	192
1001100	131
1001101	170
1001110	77
1001111	114
1010000	141
1010001	217
1010010	175
1010011	280
1010100	61
1010101	124
1010110	120
1010111	166
1011000	131
1011001	158
101101	467
1011100	192
1011101	332
101111	495
110000	392
1100010	168
1100011	192
1100100	177
1100101	163
11001100	13
11001101	48
1100111	69
1101000	131
1101001	160
110101	401
1101100	144
1101101	174
110111	468
111000	444
1110010	205
1110011	225
1110100	189
11101010	110
11101011	151
1110110	154
1110111	228
1111000	142
1111001	148
11110100	65
11110101	133
1111011	193
111110	438
1111110	206
1111111	292
Average	200

dirvine · January 19, 2018, 10:34pm

Cheers for the testing, just a small point.
I suspect age start at 4 is actually here we want to be. 1-3 is just work for the network, but 4 means a node does some work and the network does minimum.

tfa · January 20, 2018, 12:01am

No problem, just ask .

I tested with init_age = 4 and different values for max_young parameters. The results were not good.

Then I realized that, in my modification to insure a relocation is done for every event, I must add init_age instead of simply 1. With this modification the results are much better:

Rejections: 44357
Sections: 2073
Average section size: 19.33

But in return the simulation is now much longer to execute (about 53 mn), so I have only tested with max_young=1. I will test other values later. But I think that a great step has already been made.

digipl · January 20, 2018, 12:50pm

Good numbers and excellent work.

If possible, can list the nodes by final age?

tfa · January 21, 2018, 12:28am

Some more numbers:

I have tested with different values for max_young parameter and the results are very good:

Max Young	Sections	Rejections
∞	2360	0
1	2073	44357
2	3642	31845
3	4042	25093
4	4122	21000
5	4108	17795
6	4125	15389
7	4098	13608
8	4088	11936
9	4080	10745
10	4084	9651

There is large plateau with more than 4000 sections. This means we don’t need to be very precise on the number of young nodes allowed in a section, though we cannot completely remove the control, as demonstrated by the first line.

Then I have specifically analyzed the results with max_young=6 (meaning that 6 young nodes are allowed in a section). Here are different metrics computed from the number of nodes in each section, group by prefix length:

Prefix len	Count	Average	Min	Max
11	55	27.62	15	48
12	3902	16.58	10	32
13	168	14.10	10	22
All	4125	16.62	10	48

Count is the number of sections having the prefix length (for example there are 55 sections with prefix length 11), and average, min, max are aggregates values computed on the number of nodes of these sections (for example the average size of these 55 sections is 27.62 nodes).

Lastly, for @digipl, the nodes by final age in this configuration:

Age	Count
1	0
2	0
3	0
4	14422
5	17373
6	16861
7	9551
8	5422
9	2702
10	1429
11	626
12	149
13	28
14	3
15	1

rob · January 21, 2018, 12:45am

Excellent work again

Just wondering what the plateau situation is like in comparison. Is it all length 12? Is there some “control system” mechanism that drives the optional prefix length to be 12 and sections eventually become mostly prefix length 12.

I think it might be beneficial to see that analysis for all above max_young == 6 and at least for max_young == 10