RFC 44 – Relay Nodes

frabrunelle · September 13, 2016, 3:44pm

Discussion topic for RFC 44 – Relay Nodes

AndreasF · September 13, 2016, 3:51pm

My main concern with this RFC is that it will add a lot of complexity (a whole new type of node, with its own rules for interactions with the other nodes and clients) compared to e.g. allowing the node to join but make it relocate (once - or periodically?) after a certain amount of work, and only allow it to vote after the relocation.

That would seem simpler to me, it would also require work before getting a vote and a permanent name, and I don’t quite understand how relay nodes would benefit the network.

If it’s only connected to one group, it simply adds an intermediate hop between the client and the network. The group’s nodes still need to relay the client’s requests and route the responses back via the relay node, instead of directly to the client. If it’s connected to nodes from different groups, it adds a lot of additional connections to the network’s nodes and would effectively be wired similar to a routing node itself - so why not make it one?

To be able to better estimate a relay node’s use and costs for the network, it would help if the RFC contained more detail about exactly which nodes the relay is supposed to connect to and which and how many relays a client will use.

And finally, we’ll have to play with the following numbers until it’s secure:

If an attacker has the means to run A average nodes at a time and we consider a network with G groups that is churning without growing or shrinking significantly, it would take the attacker G / A times the time to get a new node into the group of their choice, compared to how long it takes for an average user to join the network. It would therefore take them QUROUM * G / A times that long to actually take over the group.

E.g. in a network with G = 10000 groups and quorum size 5, an attacker with A = 100 nodes would take 500 times as long. We probably want that timespan to be at least several years, so every user would have to run their relay node for several days until they are allowed to become a vault?

I imagine that periodic relocation (with exponentially increasing pauses) would fix that vulnerability, because the attacker wouldn’t be able to keep the nodes inside the targeted group, even once they managed to join.

dirvine · September 13, 2016, 4:15pm

Nice feedback @AndreasF

I think this type of node already exists, but is hidden in conditional statements in code. If it were a concrete type then it becomes easier to reason about and hopefully code. I see the saem with bootstrap nodes actually. IMHO these should be concrete and use the type system more correctly.

This is, in reality what happens here, the node relocates from Y to Z after it does work.

The RFC states, but perhaps not clearly enough that when a node promotes to ManagedNode though it shoudl not have full rights or equal weight. This still requires the node growing in trust and I think data chains provides some neat tools to allow this to be further considered

Agree, these do only connect to one group (so far) and the group will relay responses back through these nodes, but not refresh all messages, the group member is only a hop in the network.

The reason is to ensure only nodes that are capable even get on the first rung of the ladder, i.e. they must be able to route and deliver messages back to clients, otherwise they get ignored and cannot build the required count to promote.

Agree this shoudl be added perhaps, I think testing will help, but initially these nodes should not exceed 50% of group size. Clients will connect to their group (X) and be told of these relays only. They then use only these to communicate to the network. A client shoudl use all relays in the group for it’s own benefit, more parallel requests and much less a chance of losing a network connection.

On losing a relay a client should query the relays it is connected to to find new relays. This can be improved so every relay informs every client of new relay.

I think this misses a potentially important point, the attackers nodes will have to perform work to get to Z and that will have a cost. The cost needs calculated I agree, but it’s not free. I think these numbers probably need a bit more clarification, but we must consider the number of relays will be spread through the network and won’t promote until the group farms a safecoin (well mint’s one actually).

Perhaps it’s best we get the RFC solid and with more clarity to then calculate these numbers?

rob · September 14, 2016, 1:12am

I see a problem here if you don’t allow vaults to be vaults within a short time. Many people do not leave their computer (thus spare resource) on over night. I would hope that a vault can be a vault very quickly and be storing chunks / serving chunks.

I thought the problem to be solved was nodes other functions and not the function of vault storage.

AndreasF · September 14, 2016, 10:48am

I think this type of node already exists, but is hidden in conditional statements in code.

But the tasks that the new node would remove from are just relaying to and from clients. The tasks that they would add are:

Relaying to and from relay nodes.
Referring clients to relay nodes.
Evaluating relay nodes.

That still looks to me like it would neither remove complexity from the node implementation, nor offload work from the nodes.

This still requires the node growing in trust

The danger I see here, in the absence of periodic relocation, is that once an attacker has a lot of nodes in a group, they can just keep them and run them for a while (possibly months), until they are trusted enough to control the group.

I think this misses a potentially important point, the attackers nodes will have to perform work to get to Z and that will have a cost.

No, that’s exactly what I mean by running “A average nodes”: They do perform the same work as any other user before they get to group Z and potentially drop their node and start over. I think I took all of that into account in my estimate.

AndreasF · September 14, 2016, 10:49am

I thought the problem to be solved was nodes other functions and not the function of vault storage.

I agree, that’s another argument for allowing the node to fully join: The security aspect is just about the vote for group consensus, not about allowing it to store data, etc. So a node that doesn’t stay online for a long time could still perform all tasks of a full node and earn actual Safecoin, just without having a vote.

dirvine · September 14, 2016, 2:12pm

I see this as a huge workload though, it causes normal nodes right now to overload and be attacked that way.

Routing messages should be a routine issue for nodes, synchronising accumulating out of band messages though is work that should be avoided.

Agree on the evaluation of relayNodes, but we will have to evaluate all nodes soon. The evaluation hee is a simple one, sync and agree on put responses, not all messages. That way we use the paid messages to evaluate relays.

When The node gets into a group then I think there is several further steps in the rank algorithm/process and periodic relocation on every safecoin farmed by the individual node could prove very beneficial. I see this as a separate RFC though. This one focusses on getting into a group in the first place. i.e. step 1.

I am missing the point of this action, can you elaborate here please.

AndreasF · September 14, 2016, 2:56pm

Right, the client spam attack is a good point.
(But there’s probably also other ways this could be prevented. E.g. a node could just refuse to accept a client and refer it to another node.)

As long as bandwidth is the network’s bottleneck and every full message is sent via several hops, routing messages is probably the bulk of a node’s work.

I also don’t understand how accumulation could be avoided; but for that, too, it would help me a lot if the RFC were more concrete regarding which nodes the relay node needs to connect to.

I’m just trying to estimate how many resources and how much time an attacker would need to invest to gain control over a group. So I assume that an attacker has the means to run A average nodes and that the network has G groups and quorum Q. If the attacker wants to target a group, this RFC will slow down their attack considerably, of course: they have to run a relay node as long as the average user has to, until they are assigned their permanent network name. They have to make G such attempts on average until they land one of their nodes in the target group. Since they run A nodes in parallel, the time it takes them to get Q nodes into a targeted group is Q * G / A times as long as it takes the average user to get their node promoted to a full node.
My concern is that Q * G / A might turn out to be too small a factor, because in the end, getting your legitimate node onto the network should be easy while taking over a group should be practically impossible.

dirvine · September 14, 2016, 3:06pm

I do not think the Relay is involved in group to group hop messages? perhaps that’s a mis-communication?

A relay is a single address in the network, although their are several a message is routed by only one of these. They should not stress the network but remove the stress of nodes providing relay for clients. If we imagine a client in parallel putting up thousands of chunks, the relays pick these up and send them to their destination as a single message (ofc we slice it up etc.), but not as a group message.

Perhaps I am seeing this differently and not communicating it though.

I think though here you assume the attacker nodes all out perform all normal nodes on the network? That makes it easier to reason perhaps, but maybe not as realistic?

Interesting area to dive further into though. So perhaps consider this adds a delay D.

It also should keep low performing (incapable) nodes out of the routing tables. This is a big win IMO.

Then we can focus on D, but I think that is not where the security comes form completely at all. As a node gets into the routing tables then Rank has to be considered, i.e. relocate on safecoin etc.

AndreasF · September 14, 2016, 3:18pm

Ah, sorry, I wasn’t referring to relay nodes at all there. I just meant that it’s the bulk of work of a regular node.

Makes sense.
And if the relay node doesn’t route them all via the same node that would better distribute the load.

No, just that they can run A average nodes. After all, a regular user also doesn’t have to outperform everyone else just to get their relay node promoted to a node.

Ah, okay.
(I had misread the Motivation section as this being an alternative to periodic relocation.)

dirvine · September 14, 2016, 3:22pm

Ah they do. sorry should be more clear in RFC

So there are <50% relay per group (say 3). When the group mint’s a safecoin (simulated by just a delay in early tests) then the top performing relay based on (for the moment) number of PutResponses received is promoted. Then the count resets for all Relays.

So slow nodes should never get promoted over better performing nodes. Clients stop using nodes that perform slowly so there will be no PutResponses with that Relay identified as a relay. Therefor only best performing relays are promoted and the count restarts.

AndreasF · September 14, 2016, 3:30pm

Makes sense.
But I don’t think it changes my estimate by a lot, just what I mean by “average”: I assume that the A nodes the attacker can run are fast enough that after a few attempts they manage to get promoted. They are average among the successfully joining nodes: the time it takes them to be promoted is the average time it takes a legitimate user to do so.

And the groups in which they compete are different from the group they get a chance to join after being promoted, right? So the <50% relay nodes doesn’t protect the target group at all, as they are not joining the target group as relay but as full nodes.

dirvine · September 14, 2016, 3:55pm

Yes group Z is the hash of the safecoin minted, so “random”

Not sure what you mean here? The target group would be Z, I think. So this should further increase the probability only nodes with enough resources get through to Z. When they get to Z they are routing table nodes, but this does not mean they have a vote at all?

dirvine · September 14, 2016, 3:59pm

Maybe we are confusing secure with 100% security or similar. I do not see any single part of the network providing that, but each part shoudl increase security. Maybe it’s a naming issue of “secure” it could be further secure or protect against low performing nodes or similar?

This RFC aims to ensure that a node gets to Z securely and by that with a secured Id and proven resource capability. Maybe there is a better title there that leads the conversation to a narrower, more defined role?

AndreasF · September 14, 2016, 4:14pm

No, that’s fine. I think it’s more the Motivation section that I had read as claiming that it alone would prevent that attack.

Yes, exactly. That’s why this RFC does increase the time needed for an attack, as in my estimate.

OK, with that and further forced relocation, the attack doesn’t work anymore, of course.
(Depending on the details of the future Periodic Relocation feature.)

dirvine · September 14, 2016, 4:16pm

Ok then this is probably the key point to re-enforce perhaps, plus some more work on cleaning up specifically what the clients should do/expect?

AndreasF · September 14, 2016, 4:19pm

And maybe which (how many, in which groups) connections to full nodes a relay node needs to establish. That’s a point I’m still unclear about.

dirvine · September 15, 2016, 12:14am

Do you mind reviewing @AndreasF, ofc shout if you think this needs more.

AndreasF · September 15, 2016, 8:23am

Thank you for the clarification!

I’m still not sure that the benefits are worth the added complexity, though. The group still has to relay every message from the client, and relay back every response (and accumulate it as a group, if we implement the group hops), so the only way in which this helps the group would be that the relay node can send each message from the client to a different member.

dirvine · September 15, 2016, 8:38am

But also, the group itself should not have to weed out the incapable nodes by some other algorithm here. The least capable nodes should be automatically not promoted.

However, even from this perspective it assumes many nodes waiting to join. So if there was only one node and it managed at least 1 PutResponse then it could get promoted, but that is probably highly unlikely.

I do not see why the groups would accumulate anything for the relaynode/client that can all just be pushed back to the client I think.