Data Chains - Deeper Dive

tfa · January 28, 2018, 7:36pm

There are more churns in simulation 2 simply because the network is bigger, and it is bigger because there are less nodes that cannot connect. So, no downside in this area.

The number of churns is even smaller if we look at the figures when simulation 2 had a similar number of nodes:

Metrics	Simulation 1 at step 890000	Simulation 2 at step 260000
Adds	800814	234114
Drops	62523	18053
Rejoins	26663	7833
Relocations	503815	376067
Rejections	578686	56978
Churns	1912122	1021403
Sections	8258	8358
Section nodes	161447	163394
Left nodes	26287	3997
Rejection rate	72%	27%

Simulation 2 is much better than simulation 1, except in the number of distinct prefix lengths which varies from 3 to 4: the values are better (I mean prefix lengths are greater) but they are slightly more dispersed than in simulation 1.

So, I improved the program to make sure that when a node is relocated to a section it is placed in the section half having less adults. The simulation is running right now. Results are expected in a few hours.

tfa · January 28, 2018, 9:33pm

Result with the new algorithm to balance the sections surpasses all expectations: there is only one prefix length, except during the period of transition from one prefix length to the next one where there are 2.

As a consequence, the number of section is now a stepped curve equaling exactly a power of 2 between each transition (in green):

The rejection rate has large oscillations with minimum values at transitions:

The number of nodes is slightly better:

But the real gain of the new algorithm is the homogeneity of the network.

digipl · January 29, 2018, 10:08am

Nice work Thierry, as always.

The question I ask myself is whether it is possible that the network can get more than half a million Vaults in a relatively short time. Which leads me to insist that, beyond the search for technical solutions, socio-economic factors are critical and must be taken into consideration.

What is the expected number of Vaults at the beginning and what growth is expected?

Do we want a Big Bang type of growth or a slower one?

What type of distribution is going to be chosen? Linear, exponential, logarithmic …? And to what extent will this affect the number of Vaults and their growth?

Who and why would someone spend resources attacking the network?

etc etc…

Yesterday a play with the Adam simulation. Those are the result until 12000000 Iterations with standard parameters. Strange behaviour, huge rejections and from 6000000 iterations there are no change in the metrics.

Iterations: 12000000, nodes: 1183, sections: 68, merges: 78, splits: 1217, relocations: 1143 rejections: 510853029 }, AgeDist { min: 4, max: 1079, avg: 258.13 }, SectionSizeDist { min: 9, max: 27, avg: 17.40 }, PrefixLenDist { min: 1, max: 8, avg: 7.12 }, MaxPrefixLenDiff: 7

And this result with relocation-strategy youngest first and 50000000 iterations

iterations: 50000000, nodes: 1606, sections: 94, merges: 104, splits: 1725, relocations: 2710 rejections: 378513740 }, AgeDist { min: 4, max: 1247, avg: 284.62 }, SectionSizeDist { min: 9, max: 32, avg: 17.09 }, PrefixLenDist { min: 1, max: 9, avg: 7.62 }, MaxPrefixLenDiff: 8

tfa · January 29, 2018, 1:07pm

I don’t think we can choose the kind of growth because it will depend on farmers willingness to provide vaults and keep them running.

We can help the growth by:

Providing incentives for farming (this is what is intended with safecoin rewards)
Reducing barriers to farmers wanting to participate (by that I mean minimizing the rejection rate)

But apart these elements, the growth is largely outside our control.

That’s strange. I started with @bart code and I didn’t studied @madadam code nor did I test it. So, I cannot tell where the problem is, but these results are not the expected ones. Maybe the code isn’t ready yet. It could be also a parameterization problem.

That could be the parameter that causes the problem. @bart’s code implements oldest first strategy. It is hard coded, and I left it as it is, because @mav made a convincing case that it is the right choice (youngest first is double discrimination for relocation of old nodes).

rob · February 5, 2018, 5:23am

5 posts were split to a new topic: Network Health Metrics

tfa · February 4, 2018, 8:58pm

This weekend I have added 2 small modifications:

I have implemented my idea to replace relocation to the neighbouring weakest section by a 2 steps relocation: first to a distant random section and then to the weakest neighbour of this section. This idea was expressed in this post.
Second one is not an improvement but a simplification: I have used the random() function to compute, well … the random section. But in reality vaults cannot use this function because the result will be different between each vault. The randomness must be computed from a common data, typically the hash of an agreed state.

For example, current routing code relocates a joining node to the hash of its name concatenated with the ids of its 2 closest nodes. I don’t think that this algorithm will be kept, but I don’t need to know what the new algorithm will be. The important characteristic for the simulation is that it is a random address, how it will be computed in the real network doesn’t matter.

The same is true for the age of the node to relocate: In @bart code It was computed from the hash of the event, but when we look at @madadam repo, it has been changed to another thing. I make the same argument here: we don’t need to know the final implementation, we only need to compute a random age.

So, I have replaced this usage of hash function by the random function again. Then the hash has been completely removed, because it was not being used anywhere else.

Results are very similar but are slightly less stable (very slightly). I observed two instances with 3 distinct prefix lengths (for example at iteration 310000):

Prefix len	Count	Average	Min	Max	Standard dev
12	1	17.00	17	17	None
13	247	21.62	11	25	3.35
14	15886	12.43	8	16	1.20
All	16134	12.57	8	25	1.69

Also, there is sometimes a merge that happens during a stable period with a power of 2 number of sections (for example at iterations 390000, 400000, 410000):

Prefix len	Count	Average	Min	Max	Standard dev
14	16384	15.38	10	20	1.11
All	16384	15.38	10	20	1.11

Prefix len	Count	Average	Min	Max	Standard dev
13	1	19.00	19	19	None
14	16382	15.77	12	20	1.16
All	16383	15.77	12	20	1.16

Prefix len	Count	Average	Min	Max	Standard dev
14	16384	16.16	12	21	1.18
All	16384	16.16	12	21	1.18

Intuitively, I would have thought that relocating nodes at distance is more secure, but these results are not proving this. Though the difference is tenuous: one section over 16000 and I don’t show the curves because they are indistinguishable from previous ones.

Maybe I should add a parameter to allow comparison with more tests. Something like RelocationScope with 2 values (Local and Distant).

mav · February 5, 2018, 4:45am

I think this mainly applies to attack scenarios like the google attack where particular sections are flooded. This is where the relocation distance matters more.

digipl · February 5, 2018, 9:22am

Exactly. The idea of distant random relocation is not better metrics but avoid possible attacks taking advantage of the relocation in the neighbourhood. An attacker could focus their efforts on a few sections, in whose vicinity has several nodes, and, via DDOS, eliminate certain competitors or even forcing merges. This would allow to be in positions of advantage to gain control of a group.

This type of attack has similarities to games like Go where machine learning via artificial neural network, like AlphaGo, are proving to be extremely effective.

tfa · February 11, 2018, 9:39pm

This weekend I greatly optimized network simulation. For example, for 900000 iterations the duration went from 5h48 to 1mn20.

Here are the optimizations, with new duration of 100000 iterations displayed at each step (initially 2mn52):

Reverse fields order in Prefix structure so that entries in btree maps are ordered by bits value instead of len value. This allows to get the section containing a node, by finding the first entry less than its name instead of iterating over all the entries.
=> 2mn48
To get the neighbouring sections, make a series of targeted tests instead of testing all possible sections. The algorithm is the following: for each position in prefix:
- flip corresponding bit from the prefix and test it
- if test is unsuccessful, shorten the prefix and test again
- shorten until success or until modified bit is reached
Note: Testing longer prefix is of no use because corresponding section are healthier and so, we don’t want to relocate nodes to them.
=> 2mn44
Cache sections drop weights. Recompute it only when a section is split or merged and update it when a node is added or removed. Use this cache to compute total network drop weight.
=> 2mn
Use this cache to compute random node to drop
=> 1mn29
captured output at end of simulation instead of every event.
=> 3s

Additionally, I have:

corrected the problem that only neighbouring sections were considered to find an unhealthy section. I was forgetting that the distant section itself could be unhealthy too. The result is slightly improved (3000 more nodes on a total of 580000 after 900000 iterations).
removed inc_age parameter because I don’t see why age of nodes would be incremented at section merge or split (besides only those in the first half were incremented in case of split).
added a parameter to specify if node relocations are local or distant
displayed the number of distinct prefix lengths. A homogeneous network should have at most 2 distinct prefix lengths.

With these optimizations simulations with 10000000 iterations become possible (they take 2h52mn on my PC). Here are the results with max_young=4, init_age=4, relocation_rate=aggressive, relocation_scope=local:

With relocation_scope=distant the results are similar except for the discrepancies indicated last week, with appearance of 3 distinct prefix lengths:

Sections with 3 distinct prefix lengths appear in 129 iteration with sometimes several sections having the shorter prefix, up to 6 sections like at iteration 9670000:

Prefix len	Count	Average	Min	Max	Standard dev
17	6	16.83	14	19	1.94
18	2560	19.01	9	25	4.86
19	519144	12.06	8	16	1.16
All	521710	12.09	8	25	1.30

I think the explanation is the following: with random relocations applied on thousands of sections it happens that some of them lose a lot of nodes. With local relocations, this will create an excess of nodes in the sections around such a weak section, so there is a high probability that these nodes will come back to their initial section. With distant relocation the nodes are sent far away and the probability that a node is relocated to the neighbourhood is low, or more exactly standard without any consideration that there is a weak section here. When this weak section has less than 8 nodes, it will merge and create a new section with a shorter prefix length, hence the appearance of 3 distinct prefix lengths.

Distant relocations are more secure because they prevent section targeting, but this instability is harmful. So, to try to solve both problem I have added a third value to relocation_scope: mixed value which indicates a 50%/50% probability between local and distant relocation.

Simulation with relocation_scope=mixed shows that the instability has not totally disappeared, but the results are better: there are only 9 iterations with 3 distinct prefix lengths and each time there is only one section with the shorter prefix.

Then I wanted to find the right mix of local and distant relocations. So, I have renamed this parameter and modified it to take a numeric value for the distant relocation probability. I have got the following results with various values for distant relocation probability:

Distant relocation probability	Iterations with 3 distinct prefix lengths
100%	129
70%	96
50%	9
30%	29
0%	0

The result for 30% is disappointing because it means we still get 3 distinct prefix lengths with a low rate of distant relocations, which is both unsecure and unstable.

Then I modified the program, to not relocate a node when relocation would trigger a section merge (meaning its size is GROUP_SIZE before relocation). That was not enough so I changed the condition to size <= GROUP_SIZE + 1 (meaning that I also don’t to want to risk a merge if a node drops the section afterwards). There is still one iteration with 3 distinct prefix lengths at 100% distant relocation probability, but the end is near.

Last try was with the same condition but with 80% distant relocation probability, and at last there are at most 2 distinct prefix lengths all along the simulation. With 80% probability of distant relocation each time a node is relocated, targeting a section remains impossible.

So, finally I have got a network that is both stable and secure.

Another interesting feature of this algorithm is that the minimum number of nodes in a section is higher (9 instead of 8):

dirvine · February 14, 2018, 3:52pm

The value here is beyond any expectation @tfa Thank you very much indeed.

Qi_Ma · February 14, 2018, 4:44pm

not sure, but this maybe an overkill. In the production code, there is not such knowledge of the whole network.
Then trying to figure out which section is the least healthy one will be difficult and at least involves certain amount of messages to be exchanged.

tfa · February 14, 2018, 6:06pm

There is no need to know the whole network. The distant section is chosen at random. If this section has a neighbour that is in worse shape than itself then the node is relocated to this neighbour. This selection is done locally by the distant section.

I think there is something similar that is already implemented in current routing code for relocation of joining nodes (in handle_expect_candidate).

Qi_Ma · February 15, 2018, 1:45am

Yes, you are right.
As long as the least unhealthy section is only among the neighbours, not among the whole network, there is no need to have the knowledge of whole network.

There might be multiple hops of relocation, and each hop incurs a group consensus to be achieved. But I think it is acceptable.

tfa · February 15, 2018, 6:46am

For now there is only one hop in the simulation program.

A recursive approach is an improvement I should consider (looking at the selected neighbour’s neighbours until there isn’t any worse one).

Qi_Ma · February 15, 2018, 8:40am

Just want to confirm that this says total of 580000 nodes, but the diagrams below showing total of 520000 sections and 25 nodes per section which means a total of 13m nodes.
which one is correct?

tfa · February 15, 2018, 9:13am

For 900000 iterations, the figures are on the left side of the diagrams, so they are hard to estimate and I have the real figures only available at home.

At this iteration there are approximately 18 nodes per section on average and exactly 32768 sections (the nearest power of 2), which makes grossly 590000 nodes.

Qi_Ma · February 15, 2018, 10:06am

@tfa: ah, never mind, I mis-read the iterations in the diagram.
The numbers are ok.

madadam · February 16, 2018, 11:14am

Hey @tfa,

Great work! Don’t know if you are aware, but we’ve been working for some time on a new simulation (hosted at https://github.com/maidsafe/datachains_sim). The main difference from the old one is how Joins and Drops are calculated. In the old sim, there is at most one Join and one Drop event per iteration. In the new sim, the number is proportional to the number of sections in the network. This creates much higher pressure on the network (likely higher than the real network will be experiencing) which should in theory uncover potential problems sooner. The flip side is that the new sim is quite a bit slower than the old one.

We are currently looking into the changes you made to see how we can potentially apply them to the new sim. We’ll report back with out results as soon as we have some. Please let us know if you’re interested in updating your version of the simulator to the rules we’re using in the new sim and we’ll try to help with it.

tfa · February 16, 2018, 8:04pm

To take into account that network activity is proportional to its size, can’t we just suppose that one iteration corresponds to a fraction of a unit of time and this fraction decreases as the network grows (because the number of events increases per unit of time). Time is a growing function of iterations, but the growth is decreasing as time passes, something like log functions.

No flip side in fact: the iterations are slower with new version, but more iterations are needed in the old version to simulate the same real time duration.

Thanks for the offer. But not for now, I am working on a new metric more precise than number of distinct prefix lengths to express network homogeneity. Possibly later.

tfa · February 18, 2018, 6:45pm

I have been thinking about sections with distinct prefix lengths: What is the problem when there are 3 distinct values?

The simple answer is that a section with the shorter prefix manages an XOR range 4 times larger than another one with the longer prefix. This means it will store 4 more times data on average.

But the complete answer is more complex because the shorter prefix section is possibly issued from a merge (meaning it has more vaults to manage the larger range), whereas the longer one can be issued from a split (meaning it has less vaults but manages a smaller range).

So, the important metric is not the prefix length range alone and the number of vaults in the sections must be incorporated to it.

For this purpose I have added a new metric called “density gap” which is the ratio of highest section density over lowest section density where density is:

number of vaults in the section / section size.

and section size is defined in following table (where Lmax is the greatest prefix lengths):

Prefix length	Section size
Lmax	1
Lmax-1	2
Lmax-2	4
…	…
Lmax-n	2^n

When there is only one prefix length, the density gap is simply the vault count of the most populated section divided by the vault count of the least populated one. When there are several distinct prefix lengths the section size divisor takes into account the relative size of the section in the XOR space.

The density gap is a number greater than 1 and the lower the value is, the more homogeneous the network is.

Its range is similar to the number of distinct prefix lengths and so the 2 curves can be displayed in the same chart, with the same ordinate scale.

I have also added a new parameter (relocation_margin) to control more precisely inhibition of relocations from a small section. Last week, the formula I came up with was: adult count <= GROUP_SIZE + 1. I have changed it to: adult count <= GROUP_SIZE + relocation_margin, which defines the following behaviors:

relocation_margin=-1: no inhibition, this is the original behavior from @bart’s code
relocation_margin=0: relocation is inhibited if it would trigger a section merge
relocation_margin=1: relocation is inhibited if a subsequent node drop would trigger a section merge (same behavior as last week version)
relocation_margin=2: relocation is inhibited if 2 subsequent node drops would trigger a section merge
…

I have launched many simulations, all with 10_000_000 iterations, but with various values for the parameters. Here are the results for the 2 main metrics:

number of nodes at the end of simulation (which must be maximized)
maximum value of density gap all along the simulation (which must be minimized)

Log File	Nodes	Max Gap
raggressivei4y4m0d100.txt	6536206	4.27
raggressivei4y4m0d80.txt	6537338	4.53
raggressivei4y4m1d100.txt	6595668	4.25
raggressivei4y4m1d80.txt	6583353	3.56
raggressivei4y4m2d100.txt	6327650	3.40
raggressivei4y4m2d80.txt	6308415	3.20
raggressivei4y4m3d100.txt	1911206	4.25
raggressivei4y4m3d80.txt	953994	5.07
rstandardi4y4m1d100.txt	3047053	3.20
rstandardi4y4m1d80.txt	3043159	3.08
rstandardi4y2m1d100.txt	2646656	3.00
rstandardi4y2m1d80.txt	2639692	3.00
raggressivei4y4m1d60.txt	6583749	3.78
raggressivei4y4m2d60.txt	6292031	3.20

In all cases max gap is greater than 3, which means that sometimes there are nodes that manage an area in XOR space at least 3 times greater than some other nodes. Dispersion of max gap is very narrow so we can favor final number of nodes.

A good compromise seems to be raggressivei4y4m1d80 which corresponds to the following parameters:
init_age: 4, split_strategy: Complete, max_young: 4, iterations: 10000000, summary_intervals: 10000, growth: (90, 7), structure_output_file: None, drop_dist: Exponential, relocation_rate: Aggressive, distant_relocation_probability: 0.8, relocation_margin: 1

Here are the corresponding diagrams:

As I said in my previous post, time is not linear, but is more and more streched as the number of iterations grows, so the horizontal axis should be modified with a logarithmic scale if we want to show results in function of time instead of iterations. For example the strait line representing the total number of nodes is replaced by an exponential curve. Of course this is true only while the number of node creation is proportional to the network size, which cannot hold indefinitely.

Strangely, Excel doesn’t allow selection of logarithmic scale on x axis of a line chart. The workaround to get the chart was the following:

transform the line chart to a XY (scatter) chart
select logarithmic scale on x axis
transform the chart back to a line chart

I have created a github repo that contains all my log files and Excel files. The Excel files were manually edited, but a simple utility (extract.sh) allows extraction of columns from the log files.

This will conclude my tests on simulations. The parameters I have selected (-raggressive -i4 -y4 -m1 -d80) favor network growth by keeping rejection rate largely under 33%. Density gap is a bit high (3.56) but anyway it cannot go under 3. A 80% distant relocations percentage is secure enough, lowers density gap from 4.25 to 3.56 and has minimal impact on the final number of nodes (6583353 instead of 6595668).