SAFE network architecture and programming study plan

hunterlester · August 26, 2017, 6:38am

I’m devising a study plan to establish a knowledge base for myself comprised of computer science topics, architectural design, and programming skills for the network.

Goals:

Improve effectiveness as a core front-end developer
Understand low-level core libraries while learning Rust and network programming
Gain ability to answer questions more accurately on the forums and at meetup events

Although still in the process of discovery, I do know that during the process of learning I need a certain amount of my own experience coding and then solving errors. Just as importantly, however, I absolutely need a foundation in fundamental computer science concepts, documentation, and architectural design in order to understand why I’m coding and what I’m coding.

The following may take me a great deal of time as several pieces of the network are PhD research subjects on their own, for which I lack the formal training. Learning some of the more complex code bases inevitably will require many hours (years) of hands-on experience for me to grasp as well.

Anyone else whom is also on this path, there is the potential here for study groups. For example, we can choose a paper to read over a week and then convene on Google hangouts to discuss learned terms and concepts. In the case of a code repository, we could do the same, discussing purpose, function, and flow.

Subsequent posts in this thread will be more detailed summaries and questions about the article, subject, or repository that I’m learning as I work through the list.

I’ll start by taking the following introductory articles separately:

(For each, I’ll research terms that I don’t know, write questions for clarification, draw or recreate visual aids to experience concepts, and continue to refine my questions.)

https://github.com/maidsafe/MaidSafe-Documents/blob/master/Sigmoid/core-whitepaper.pdf

Moving on to focus on repositories:

Each repository will be taken on separately as a chunk of study, although at some point I imagine I’ll need to understand how they work together as a whole. Perhaps I’ll reorder this list and begin with the root crate that the others depend on: Crust?:

How does accumulator work?
What does it mean to accumulate multiple values under one key? Does that mean that a key represents a vector of values?
Experiment with code to experience accumulation

Our base for cryptography
Read, study, write questions, recreate visual aids: https://www.lvh.io/posts/nonce-misuse-resistance-101.html
Learn: https://www.crypto101.io/
Make a decision about how deeply I need to explore this subject as is useful to data encryption, weighed against my lack of mathematical training in the area

Is the purpose of this to parse JSON configuration file for use by programs?
What else? What am I missing?

“Creates and supplies nonce in message.”
What exactly is a nonce? Why is it important in encryption?
What are transient nacl key pairs?
What is a box, in the context of encryption?
What else does this repo provide that I’m missing?

Basis for files that are saved to the network to be split into pieces, encrypted, and indexed to a map
Research: https://en.wikipedia.org/wiki/Data_deduplication
Research: https://en.wikipedia.org/wiki/Convergent_encryption
“parallelised approach can actually be faster than reading/writing data as a single stream”: What is the difference between single stream and parallel approach? Single thread vs multi threads?
How does the additional obfuscation step work? https://www.youtube.com/watch?v=Jnvwv4z17b4
Run and play with tests: https://github.com/maidsafe/self_encryption/blob/master/tests/lib.rs
What is optional compression? How does it work?

The basis for declaring basic network data types (MutableData and Immutable Data)
Basis for handling data look ups and placement on the network?
Research: http://en.wikipedia.org/wiki/Distributed_hash_table
Research: https://en.wikipedia.org/wiki/Kademlia
Watch series of talks: https://www.youtube.com/watch?v=Lr9FJRDcNzk&list=PLiYqQVdgdw_sSDkdIZzDRQR9xZlsukIxD
What am I missing? What else do I need to know about routing?

Basis for connecting peers?
What is the difference between crust and routing?
Research term: TCP
Research term: TCP hole punching
Research term: UDP
Research term: UDP hole punching
Research term: NAT
http://www.goto.info.waseda.ac.jp/~wei/file/wei-apan-v10.pdf
http://docs.maidsafe.net/Whitepapers/pdf/DHTbasedNATTraversal.pdf
- Write questions, research unknown terms, create visual aids to help conceptualize
Research term: UCP
Research term: UPnP
Study and run examples: https://github.com/maidsafe/crust/blob/master/examples/crust_peer.rs
Study and run tests: https://github.com/maidsafe/crust/blob/master/src/tests/mod.rs
What are the points I’ve missed about Crust?

Handles regisration of uniform resource locators based on the platform
For example, registers safe:// with Windows registry, just as http
What am I missing?

The basis for for creating a node on the network that can store data, access data for data that is assigned to it via routing?
Vaults will eventually, autonomously, sign a block along with others in a section to build data chains?
Research disjoint sections
Research groups versus sections
Disjoint: https://en.wikipedia.org/wiki/Disjoint-set_data_structure
A vault (node) can be a part of multiple groups but belong to only one section?
What am I missing?
Run and play with tests
Look into data visualization to watch node network in action

Basis for setting up a client that creates a user account, authenticates applications, handles requests to the network?
Provides foreign function interface in order to be able to write libraries in various programming languages that will then allow applications to interface with the network, using standard C data types?
The term self-authentication seems to be an important term to the network. What exactly does that mean?
Continue to understand client libs as necessary for font end work.
Create a project using libraries to build client for command line on Rasperry pi.

Secondary:

Become acquainted with and follow rfc statuses: https://github.com/maidsafe/rfcs/blob/master/RFCs-by-status.md#active-rfcs

oetyng · August 29, 2017, 11:52pm

@hunterlester, what a great idea, and superb job with that collection.
I’m definitely in.
I hope we wouldn’t get out of synch with the pace (I assume you will have quite some steem up already ).
But I’m in

will re-read the articles listed at top. (The pdf is a dead link btw). And I’ll send you a hangout msg.

UTC +2 here, btw.

bzee · August 30, 2017, 8:35pm

Interesting share. I’m on the same road to get a more thorough understanding of the technologies involved.

I personally like to do this top-bottom, as I’m initially and primarily interested in how to use the APIs to craft applications.

Study group sounds good!

hunterlester · August 30, 2017, 10:52pm

STRUCTURING NETWORKS WITH XOR
(Structuring Networks with XOR | by David Irvine | MetaQuestions | Medium)

Due to the necessarily high ID space and random placement, decentralized networks will not have nodes occupying every value in the ID range and therefore, the closest nodes are most likely not going to be one of the closest 3 nodes like the example above.

What’s the protocol for how the closest nodes are calculated, if all id’s in a range don’t exist? Let’s say you start with an input value of binary 10 (01010) and want to find 4 closest xor nodes. Starting with binary 1 (00001), the xor output is binary 11. What if binary 11 does not exist as an ID on the network, or does exist but is offline? What happens? Is this ID skipped, and the address to search for is incremented? When searching for closest nodes, will the algorithm always begin searching with binary 1?

It is extremely important in XOR networks that each node has a unique ID. In decentralized networks where there is no central party to issue and enforce unique IDs, this task requires a large enough ID range (ie. with 128-bit or 256-bit numbers) to reduce chance of overlap.

What is chance of overlap? How would I calculate that chance? What happens in the event of an overlap?

What will be the ID range on the network?
XOR ID’s are 32 bytes long. I see here, https://github.com/maidsafe/routing/blob/master/src/xor_name.rs#L36, that the bit length of an XorName is 256 bits (32 * 8)
So, is the range from 0 to (2**256 - 1)?

oetyng · August 30, 2017, 11:47pm

How funny, I was wondering the exact same thing. I think there must be some other algo than a linear search though since the range is so absurdly large.

Collisions would normally use some hash collision algorithm. I’m not sure though if the decentralized nature would hinder it.
Read more here about probabilities also: http://preshing.com/20110504/hash-collision-probabilities/

Given k randomly generated values, where each value is a non-negative integer less than N, what is the probability that at least two of them are equal?

e^(−k(k−1)/2N)

And I think you have the range there, 0-2^256.
So, you could take a guess at how many nodes or files there would be, and find probability of collision with that evaluation.

digipl · August 31, 2017, 8:31am

Fraser can ask that…

https://stackoverflow.com/questions/25751928/kademlia-xor-metric-properties-purposes

hunterlester · September 3, 2017, 3:17pm

It appears that the answer to my question begins here, https://github.com/maidsafe/routing/blob/master/src/routing_table/mod.rs#L523, where all sections of a routing table are iterated over, https://github.com/maidsafe/routing/blob/master/src/routing_table/mod.rs#L309, to find the closest nodes to a given XorName.

Sections, https://github.com/maidsafe/routing/blob/master/src/routing_table/mod.rs#L132, are made up of BTreeMap’s, where each section Prefix key corresponds to a tuple value composed of a version and a BTreeSet of section member XorName’s, perhaps?
I’m reading about BTreeMap performance:

In theory, a binary search tree (BST) is the optimal choice for a sorted map, as a perfectly balanced BST performs the theoretical minimum amount of comparisons necessary to find an element (log2n). However, in practice the way this is done is very inefficient for modern computer architectures.

Currently, our implementation simply performs naive linear search. This provides excellent performance on small nodes of elements which are cheap to compare. However in the future we would like to further explore choosing the optimal search strategy based on the choice of B, and possibly other factors. Using linear search, searching for a random element is expected to take O(B logBn) comparisons, which is generally worse than a BST. In practice, however, performance is excellent.

It appears that the answer is that we are using linear search and that it performs excellently. I was originally assuming that all possible ID’s in the 256-bit range were being traversed, to find the closest nodes to a given XorName. On the contrary, known names are being compared, section by section.

Further reading, not for deep research but just to become basically acquainted:

Big O notation: these appear to express performance costs of operations?
Binary Search Tree: BST’s seems to be compared to BTreeMap’s in the Rust std collections docs. Are they simply a different type of data structure?

Using this hash collision calculator, http://everydayinternetstuff.com/2015/04/hash-collision-probability-calculator/, based on the article that @oetyng posted above:
With a 256-bit hash id range, we hit the first probability of 4.440892098500626e-16 for a hash collision when 10 decillion ID’s, 10_000_000_000_000_000_000_000_000_000_000 have been created.

It’s currently estimated that 2.5 Exabytes of data are created on the internet every day. Incorrectly assuming that this will remain constant, because I don’t understand the factors for the rate of increase over time, it would take 4 quintillion years, 4_000_000_000_000_000_000, to reach 10 decillion ID’s. The assumption is that each 1 Mb chunk of data has it’s own ID. I’m sure there is something I’ve missed or some miscalculation, but at the very least I can see the vast amount of time it would take to even consider hash collisions.

intrz · September 3, 2017, 10:06pm

Big O notation is a notation used to describe how fast an algorithm will run depending on the size on the input. For example O(1) means it is independent on the size of the input, O(n) means it has a linear dependency on the length of the input, for example when you have to do something to every element of an array. You typically have a big O value for the average case and one for the worst case.

A binary search tree and a B-tree are two different data structures. A B-tree is a generalization of a binary search tree and is often used for database indexes and file systems.

BTreeMap is a map, i.e. something that maps keys to values, implemented with a B-tree. Maps are generally either hash maps or sorted maps. In a hash map the keys are not sorted, but lookup is O(1) while in a sorted map lookup is usually something like O(log n), but the keys are ordered. A BTreeMap is a type of tree map which is a type of sorted map, other search tree data structures like Red-Black tree can also be used to implement a tree map. You use a sorted map when you need to iterate over the keys and need them to be in order.

am2on · September 6, 2017, 12:54pm

I’d love to be a part of the study group as I’m also interested in digging deeper into the network architecture of SAFE, but I’m scared it’s going to be way over my head!

Guess I’ll just start reading the papers and see how well I can comprehend them

hunterlester · September 6, 2017, 2:20pm

This is all over my head. For me it started as a slow process of reading a sentence and having to look up several words in the dictionary as well as having to research some terms that are packed with contextual computer science meaning and computer science history. Then after reading I spend time thinking about it all or maybe even drawing or looking up how a concept is implemented in our code base in order to more firmly conceptualize. Getting easier as I persist and I’ve come to love it.

It’s also a personally psychological process of confronting my thoughts of inadequacy.

I’m habituating the study process so that I can stick with this long enough to prove to myself that I can understand the concepts and even be able to program these concepts.

Do join us. We’re still figuring out a good way to interact. Google hangouts may be more difficult than I realized because of timezone issues and very sparse free time on my end.

am2on · September 6, 2017, 2:50pm

Ok, sounds good!

Instead of Hangouts, can I recommend we use Riot.im (a web-client for the Matrix messaging protocol)?

I try to avoid touching Google where possible, and Riot can handle video conferences as well as text (and is open source!)

Perhaps we could create a room where people can post their findings, ask questions, and answer other questions in real time. We could then also organize video calls when most of us are available.

am2on · September 6, 2017, 9:25pm

@hunterlester

I created a room on Riot/Matrix with the address #safenetstudy:matrix.org.

Currently, anyone can join it, even on a guest account. So I invite anyone else interested in the study group to join as well

mattmc · September 19, 2017, 4:42pm

Thanks for sharing what you have discovered so far. I’m just now diving into the SAFE network and this helped me understand the concepts in use. I’m going to start diving into some code