Cloud Is Us
Cloud Is Us distributes the effort necessary to process large graph datasets to a number of so called contributors
, running in a Web browser. Each contributor
processes a tiny fraction of the graph data, which is in turn combined and delivered to the client
. The allocation of a part of the graph and the combination of the results is performed by the allociner
(= allocate + combine).
Architecture
The following steps are performed in a typical Cloud Is Us processing phase:
- The
client
initiates the processing by ingesting a graph dataset into theallociner
through providing a HTTP URI that points to the location of a dataset - called the source - in N-Triples format. - The
allociner
stream-reads the data from the client's source and allocates data chunks round-robin on a per-subject basis tocontributors
. - Once all
contributors
have loaded the data locally theclient
can issue a query, which is distributed to allcontributors
. - Each
contributor
locally executes the query and sends back the result to theallociner
where it is combined and made available to theclient
.
Performance and Scalability Considerations
The more contributors
are available to Cloud Is Us, the faster a query can be executed. The bottleneck is likely to be the allociner
, responsible both for initially distributing the data to the contributors
and combining it, eventually from them.
Let's have a look now how, given a dataset with 1 billion (= 1.000.000.000 = 1B) triples, with an increasing numbers of contributors
the processing capabilities increase. One easily runs into the dimension of 1B triples these days - take for example an application that uses statistical data from Eurostat together with data from DBpedia, LinkedGeoData and data.gov.uk.
#contributors | #triples per contributor |
10 | 100M |
100 | 10M |
1.000 | 1M |
10.000 | 100k |
100.000 | 10k |
1.000.000 | 1k |
Essentially, the table above tells us that with some 10k contributors
, that is, people having an instance of it running in their Web browser, we're are able to process a 1B triples dataset fairly straight-forward as it would mean a load of some 100k triples per contributor
.
Components
- cloudisus.contributor and cloudisus.client - rdfstore.js
- cloudisus.allociner - Node.js/rdfstore.js + Dydra
Todo
- implement round-robin stream load in allociner
- implement local SPARQL query in contributor
- implement combine in allociner
- implement client
- implement dashboard
License
The software provided here is in the Public Domain.