Initial vault unable to rejoin network


#1

I’ve been playing around with a private network for a while, it’s fun but I haven’t been able to get past one somewhat critical issue: The vault that starts the network is unable to rejoin (without the -f flag) if it disconnects from the network.

  1. I have a fairly simple config, network size set to 5 nodes, 5 VPSs running the exact same vault code/config (latest release) and everything looks about right.
  2. I launch one of the vaults with the -f (first) flag, I then start the other 4 nodes
  3. I can stop/restart the last 4 nodes in turn, they reconnect and everything seems fine
  4. If I ever stop/disconnect the origin node I receive errors (lost connection/no active children left) when trying to reconnect (without the -f flag)

Starting the vault back up again results in:
I 18-12-15 10:45:12.926335 Created chunk store at /tmp/safe_vault_chunk_store.iE7sPMV1f6sy with capacity of 2147483648 bytes.
I 18-12-15 10:45:13.935510 Bootstrapping(92196a…) Lost connection to proxy PublicId(name: aa3332…).
I 18-12-15 10:45:14.940915 Bootstrapping(92196a…) Lost connection to proxy PublicId(name: c93281…).
I 18-12-15 10:45:15.941582 Bootstrapping(92196a…) Lost connection to proxy PublicId(name: 8902fa…).
I 18-12-15 10:45:16.947767 Bootstrapping(92196a…) Lost connection to proxy PublicId(name: 5e076d…).
E 18-12-15 10:45:17.954003 Bootstrapper has no active children left - bootstrap has failed
I 18-12-15 10:45:17.954727 Bootstrapping(92196a…) Failed to bootstrap. Terminating.

It seems like despite having the same config, the remaining nodes refuse to allow connections from the origin node. Is there a max_children setting somewhere i haven’t seen or something else that is causing lost connections?


#2

Additional info:
If I disconnect another of the secondary nodes and attempt to reconnect, it also is unable to rejoin. So basically if my first node ever disconnects, the rest of the network is doomed.

I 18-12-15 12:06:05.914660 Created chunk store at /tmp/safe_vault_chunk_store.IvWyBeFJm87K with capacity of 2147483648 bytes.
I 18-12-15 12:06:06.921472 Bootstrapping(0b7d35…) Connection failed: The chosen proxy node has too few connections to peers.
I 18-12-15 12:06:06.922328 Bootstrapping(0b7d35…) Lost connection to proxy PublicId(name: aa3332…).
I 18-12-15 12:06:07.928702 Bootstrapping(0b7d35…) Lost connection to proxy PublicId(name: 5e076d…).
I 18-12-15 12:06:08.927927 Bootstrapping(0b7d35…) Lost connection to proxy PublicId(name: c93281…).
E 18-12-15 12:06:09.930050 Bootstrapper has no active children left - bootstrap has failed
I 18-12-15 12:06:09.930728 Bootstrapping(0b7d35…) Failed to bootstrap. Terminating.


#3

I came across the same problem. As if the other nodes remember the IP address of the genesis node and later don’t accept a regular node at that address.

Edit: I just saw your second post. So this problem seems to be worse than I thought. I will try to reproduce it.


#4

Can you maybe share your crust config file please? Just want to see if you have any explicit hard_coded_contacts listed. There could be two things at play here.

  1. If your hard coded contacts is empty, then the nodes are relying on service discovery to bootstrap to the network. the default port for which is likely captured by the first node which starts first then facilitates the other 2-5 nodes to bootstrap. now when you kill the first node and start it as a regular node, it thus isnt able to find a beacon to bootstrap.

  2. Also as the error from the second message shows Connection failed: The chosen proxy node has too few connections to peers.. This is triggered by https://github.com/maidsafe/routing/blob/master/src/states/node.rs#L1699-L1715. So until min_sec_size nodes exist, the only node which allows bootstrap for even other nodes is the first node and thus taking it offline is also basically game over as none can join without it until the network reaches that threshold.

You certainly dont want to restart the first node with the -f flag ofc, that’d just create a new network and not really “join the existing one”.

so if you have the routing config file overriden with min_sec_size:5, then maybe start 6 nodes normally then kill the first node and then restart it as a normal node. now provided you have the hard_coded_contact info of the other nodes setup for this restarting node, that should allow this restarting(which really isnt a restart, the network would see it just as any other new node but at this point allow it to bootstrap via non first node as well) node to join.


#5

Good input.
@tfa That sounds like what’s happening, I was imagining it was something like that.
@Viv

  1. I do have hard coded contacts (one for each node) so that’s not it, though I did have some uncertainties about ports
  2. I think you hit the nail on the head. I have 5 nodes, when I kill the genesis node it would be below the min_sec_size. I’ll try what you explained (I was considering doing something similar for other reasons anyways) and see how that goes.

FYI: My config:
.config
{
“hard_coded_contacts”: [
ip:5483”,
ip2:5483”,
ip3:5483”,
ip4:5483”,
ip5:5483”
],
“tcp_acceptor_port”: 5483,
“force_acceptor_port_in_ext_ep”: false,
“service_discovery_port”: 5484,
“bootstrap_cache_name”: null,
“whitelisted_node_ips”: null,
“whitelisted_client_ips”: null,
“network_name”: “private test network”,
“dev”: {
“disable_external_reachability_requirement”: true
}
}

Regarding ports, since I can’t find any docs that explain it, could somebody clarify these for me?:
tcp_acceptor_port: (for accepting incoming connections, or something like that?)
service_discovery_port: (for discovering/connecting to the vault service?)
I’m sort of guessing on that, I set the tcp_acceptor_port to the same thing I have in the bootstrap settings though it didn’t seem necessary. In the end I did get things working, I’d like to understand a bit more though.


#6

You get a gold star today!

I’ll have to set up a couple more redundant nodes (and test to see if there were issues with the browser not responding when you have more than min_sec_size nodes up/add issue I saw a while ago) but…it works. Started one node with -f, launched the other 5, killed the first and restarted without -f, was able to cycles through the network killing one node at a time without problems. Until I killed 2 nodes and ran into the same issue because I dropped below min_sec_size, oops (and I did it all over again)

Looks like I’m in business though, appreciate the help!

I’ll do some more testing, I’m thinking I’ll run about 10 nodes (that should create 2 sections if my min size is 5 right…?) and should be able to kill several at a time without problems. Maybe I’ll even write my own chaosmonkey type script…


#7

It’s not so simple. There is a reserve factor to avoid a merge if a node leaves one the 2 sections just after the split because this section would be below the 5 nodes limit. If I remember correctly this factor is 3 nodes, which means that the 2 new sections must have at least 5 + 3 = 8 nodes to allow the split.

This makes a minimum of 16 nodes to trigger the split. This is a minimum because if the two halves of the section about to split are not balanced, then more nodes are needed so that the smaller section reaches 8 nodes.

So you need to create temporarily at least 16 nodes to trigger the split. After that you can delete some nodes to have 2 sections with 5 nodes each.