Building a Web Crawler on SAFE Network

This is not something I have time for myself, but as it is going to be a very useful app I am starting this topic for people to post resources and ideas on how to crawl the SAFE Web, and for anyone who wants to have a go at this to hijack and make their own.

To kick off, see the article below on building a distributed crawler in Python, in cloud server that can crawl 40 pages /s. I have not even read it :blush:, but from the summary, it seems a good starting point for anyone thinking of doing this. Essentially, I’m envisaging a client side application running in the cloud, which ideally would store its results in public data on SAFE and make them browsable and searchable by other clients.

Obviously it doesn’t need to be implemented in Python, but it certainly could be, in which case it might be quite easy to adapt this code as a starting point. One could also use a scraper, but from the sound of things the project below could be the basis of something capable of handling quite a large web, which SAFE is likely to become pretty rapidly. And if none of use want to do it, maybe we should see if Ben is interested! :slight_smile:


An interesting thing about SAFE is that we’ll see the need for a new type of crawler, a data crawler rather than a web crawler.

Sure there will still be completely static html files which will need to be crawled by a traditional web crawler, but data driven websites could be crawled by a data crawler instead, or in addition to traditional crawling.

What do I mean by a “data crawler”?

Data driven web sites that doesn’t use only static HTML files can expose public data directly. This basically mean that you not only can have crawlers crawling the rendered HTML output, but you can also have crawlers crawling the source data and doing things like creating links between entities in different databases, saying this album in this database is the same at that album in that database etc.


To some extent I think we might even be able to get around the need for crawlers.

An app owner can create a search index for the app they are creating and make this index public. A search engine can then aggregate indexes like this. This aggregation would basically be that it treats the entire index as a document, so normally in a index you might have the term fish point to the document doc1 and doc2, but with an aggregate index you would instead have the index for the term fish point to something index1 and index5, so a search query for the term fish would first hit the search engine which would direct it to continue on index1 and index5. Mostly this is at the stage of it would be cool if it works, but if we could make this work, it would decentralize the whole search engine and crawling process to a much larger extent than today.