Spam Prevention

ustulation · April 13, 2017, 5:26pm

Get spam (No registered account required):

Malicious client can do valid gets of large chunks (largest for a datatype allowed by the network). Can do via unregistered accounts.
Can do invalid gets - will not be as catastrophic as the previous one but still can flood with many get requests sent out.

Mutation spam (Requires an Account):

The motivation here is that failed mutations are not charged. So malicious client can keep sending inserts of MAX capacity chunks to an address space he knows will be rejected due to say data-already-exists etc. With MutData paradigm this will cause the request to be worked upon by MaidManagers, goto DataManagers, failure response to MaidManagers and finally back to clients - so lot of work from the Network. So without spending account-balance, he will have flooded the network with huge mutations.
Even after the account balance is spent he can still flood the MaidManagers with PUTs (this time he can choose ImmutableData or MutableData, doesn’t matter as he has run out of account balance anyway). This is not as catastrophic as the previous one as it will be rejected at the MM stage, but still can be disruptive.

Solutions:

Proxy node is the SPOC (single point of contact) for client and since only we are running vaults in the next few test networks, it might be the best way to prevent DOSing the network by letting the proxy bear the brunt of checks and filters.
Proxy node has struct SpamCheck { di: LruCache<DataIdentifier>, failures: LruCache<()> }
This is used as clients: HashMap<ClientPubKey, SpamCheck>
LruCache<DI> has capacity=C0, expiry=E0 min and accounts for GETs only. So you cannot do a get for the same DataIdentifier before E0 min and you cannot do a total of more than C0 GETs within any given E0 min interval.
LruCache<()> is basically a Leaky-bucket with cap=C1, expiry=E1min. and accounts for failures in both GETs and Mutations. If that is filled the connection is severed to the client.

Other thoughts

We need to test and finalise the values for variables (C0, C1, E0 etc.), ofcourse. We also need to see how we can ban such clients from joining again and/or levy other more aggressive penalties.

rob · April 14, 2017, 12:00am

And this fails on a DDOS since there is no SPOC getting above typical rates

Then the attacker simply requests different Dataidentifiers. Increment the address of the data being requested each time then this timer is meaningless (if I understand the purpose of it correctly)

So is this for faulty GETs only?

If not then is this going to affect me when I want to retrieve my 200 to 1000 GByte star data file using a 1Gbit/s link (or 10Gbits/sec link in a few years)

Another thought is that when GETs are done without an account then it is for public data only and you could use this to apply some limits to. So the ordinary user would not be affected but a SPAMMER/attacker would be.

Then for GETs done with an account can be for private & public data and a small history can be kept. Like if SPAMMING activity occurs often then the values are tighter and like your leaky-bucket over “time” (measured in network events) the account can be cleared of being flagged as Spammy if it does not exhibit any more spammy activity. This should not affect the ordinary user you views their movie collection, etc even if they were flagged as spammy.

To use an account requires cost to make the account etc and this has effects on DDOS ability, even if only a minor one. But it does allow the account to be tightened down if detected so an ongoing attack from an account will eventually cause the account to be slowed down a lot making any spammy activity ineffectual.

Remember that spammy activity can be caused by the next big “social media meme” APP that is either badly written or deliberately malicious and remains running even after the user thinks its finished. So then it becomes a “social media” attack. The point is that even a badly written (accidentally or deliberate) APP could cause spammy activity. So any measures can also help to protect the user.

ustulation · April 14, 2017, 12:35am

You have quoted the next sentence too which answers you concern about meaninglessness. Why break and look at this in isolation when there are 2 effects at play ?

Yes this may be of concern eventually but still, i think it might be OK for starters until we evolve more sophisticated ways spam pattern analysis. There can (and probably will) be more discussions to improve on this.

I don’t understand what you mean, Network has no idea what data is cipher-text and what is plain-text (take ImmutableData for instance), so no distinction between private and public data if that’s what you are hinting. Also if you can pls explain what limits and where to apply them (for e.g. the intial write up currently it says at the proxy, so where would you suggest to put the limit and what would be this limit).

rob · April 14, 2017, 1:04am

You’re right I have miss read it as 2 separate counters. Now I see its a timeperiod with 2 separate triggering events.

People already have at home 1Gbps links and many business connections are this and more without contention ratios of the home connection. So it can be an issue today let alone when the network goes live and is large enough for large data store.

In my opinion we need the discussions now on how to do this and not overly affect users.[quote=“ustulation, post:3, topic:583”]
I don’t understand what you mean,
[/quote]

The idea was that we have 2 different types of usage. One without accounts and they can only access public data storage (obviously) And the other is use with accounts which also have private storage and ability to mutate/store data.

ALSO I thought that a computer accessing the network would have multiple relay nodes to help anon the usage patterns and also to prevent a bad actor from knowing too much about a computer accessing the network. I guess this isn’t finalised and we work with the one relay node

In effect we treat the 2 types of usage differently since the computer accessing the network without an account can only do a GET spamming and we can be a little more restrictive on their performance if above reasonable limits. Also we can get more restrictive if the activity continues. The network can also change these values to account for the network load. If heavily loaded then reduce the limits and if not then the limits can be less restrictive This only works for the current session of that computer not logged into an account.

For an computer logged into an account then we can apply the limits to the particular account, this also allows the situation were the computer is using 2 accounts, say downloading the star data file on one and doing other stuff on the other. For an account we can keep some history. A spammer will likely be using a set of accounts (for DDOS) and at different times (different attacks) so the network could flag these accounts as spammy and be more restrictive on them. This forces the attacker to spend more to get new accounts and reduce the Spammer’s effect.

No distinction between the data, but a distinction between the types of usage. No account means treating the computer/session separate to previous times. With an account we can treat the account over “time” (passing of network events) or in other words have a history for the account. This history should fade out if spammy activity was long ago (in network events terms)

Needs to be discussed / worked out. I wanted to see what you though of the concept of widening the scope a little.

Also when I wrote about the network setting the limits another thought I had was could we apply some limits to usage depending on network load. This way heavy users cannot swamp out small users when network load gets heavy. (load based on groups close by). Instead of everyone getting 40% of unloaded performance the light users get 75% and heavy users get say 30% because of the limits.

This would be achieved by limits on maximum throughput (rate). And if done right could also be the limits on spammy usage since they would be arguably the users wanting the most throughput.
.