Cloud Is Us

Cloud Is Us distributes the effort necessary to process large graph datasets to a number of so called contributors, running in a Web browser. Each contributor processes a tiny fraction of the graph data, which is in turn combined and delivered to the client. The allocation of a part of the graph and the combination of the results is performed by the allociner (= allocate + combine).

Architecture

The following steps are performed in a typical Cloud Is Us processing phase:

The client initiates the processing by ingesting a graph dataset into the allociner through providing a HTTP URI that points to the location of a dataset - called the source - in N-Triples format.
The allociner stream-reads the data from the client's source and allocates data chunks round-robin on a per-subject basis to contributors.
Once all contributors have loaded the data locally the client can issue a query, which is distributed to all contributors.
Each contributor locally executes the query and sends back the result to the allociner where it is combined and made available to the client.

Performance and Scalability Considerations

The more contributors are available to Cloud Is Us, the faster a query can be executed. The bottleneck is likely to be the allociner, responsible both for initially distributing the data to the contributors and combining it, eventually from them.

Let's have a look now how, given a dataset with 1 billion (= 1.000.000.000 = 1B) triples, with an increasing numbers of contributors the processing capabilities increase. One easily runs into the dimension of 1B triples these days - take for example an application that uses statistical data from Eurostat together with data from DBpedia, LinkedGeoData and data.gov.uk.

#contributors	#triples per contributor
10	100M
100	10M
1.000	1M
10.000	100k
100.000	10k
1.000.000	1k

Essentially, the table above tells us that with some 10k contributors, that is, people having an instance of it running in their Web browser, we're are able to process a 1B triples dataset fairly straight-forward as it would mean a load of some 100k triples per contributor.

Components

cloudisus.contributor and cloudisus.client - rdfstore.js
cloudisus.allociner - Node.js/rdfstore.js + Dydra

Todo

implement round-robin stream load in allociner
implement local SPARQL query in contributor
implement combine in allociner
implement client
implement dashboard

License

The software provided here is in the Public Domain.

	Failed to load latest commit information.
	design
	lib
	static
	style
	test
	.gitignore
	README.md
	cloudisus.config.js
	run_dev_server.js
	server.js

mhausenblas/cloudisus

Join GitHub today

Clone with HTTPS

Launching GitHub Desktop...