Merge problem still present

tfa · May 27, 2017, 1:23pm

The merge problem signaled in the other forum is still present:

I don’t think that the problem was caused by the crust issue because I still observe it in my local network tests (with routing 0.30.0). In fact, merge operations take place very rarely.

To reproduce the problem, you just have to create about 22 nodes (until the first split occurs) and then remove them one by one. The merge never occurs.

Please find here, the log of such an experiment in which a node is added every 10 s and then killed every minute.

This is a simplified test involving short prefixes (0 and 1) but the problem happens also for longer prefixes.

Viv · May 27, 2017, 4:01pm

Cheers for pinging this through. it certainly helps although in this case dont think the issue is the same. While both issues have impacted merges, that was a lot more general and affected more RPCs while this seems to be a merge process specific problem from the latest cleanup.

Just to confirm, could you please try any one of the following and repeat the process you detailed and see if the merge triggers?

Option A: only kill nodes every 100 seconds rather than 60. (this prolly takes longest to try out waiting for that time, so maybe ignore unless editing code and rebuilding might take longer)

Option B: In routing src/peer_manager.rs, try updating the function should_merge to:

/// Returns whether we should initiate a merge.
pub fn should_merge(&mut self) -> bool {
    self.routing_table.should_merge()
}

Option C: In routing src/states/node.rs, update the following line in the function dropped_peer:

if try_reconnect && peer.valid() && self.is_approved {

to

if false && try_reconnect && peer.valid() && self.is_approved {

While none of these are the ideal solution, it should at least showcase the issue better which if any of these options work, would be reconnecting nodes preventing a merge from being triggered. It is a backlog task which once actioned should exclude them essentially but the reconnect expiry timer of > 90secs is being relied on until then.

tfa · May 27, 2017, 5:04pm

Option B succeeded (a merge occurred). This is the only option I tried.

I did a simple test: empty prefix => split at 22 nodes => prefixes 0 & 1 => merge at 16 nodes => empty prefix.

Later I will do a more complex test with longer prefixes.

Edit: I did this longer test with 165 nodes. This generated a series of splits leading to the following sections:

000
001
010
0110
0111
100
101
110
111

Then I deleted the nodes one by one, which successfully triggered the merge operations, leading finally to the section with empty prefix.

All is not perfect because just after a merge some nodes are missing from some routing tables. But then rapidly the situation returns to normal.

Note: If you are interested, log file of this test is stored here on GitHub.

Viv · May 28, 2017, 1:53am

That is actually expected and should be fine as incompatible disjoint sections pre merge would not be connected and will only establish connections once they realise they are required to do so when the merge completes and get their prefixes to be compatible. So eventual consistency should ensure they lead up to re-establishing their connection. If they fail to do so then ofc that’d be an issue too but in this case it seems like it held up as expected.

tfa · June 1, 2017, 9:21pm

Today’s PR in routing seems to have corrected the merge problem: I have observed that the merge operations triggered successfully when shutting down progressively a mid-size local network (52 nodes in 4 sections).