Open Catalog

Welcome to the DARPA Open Catalog, which contains a curated list of DARPA-sponsored software and peer-reviewed publications. DARPA funds fundamental and applied research in a variety of areas including data science, cyber, anomaly detection, etc., which may lead to experimental results and reusable technology designed to benefit multiple government domains.

The DARPA Open Catalog organizes publically releasable material from DARPA programs, beginning with the XDATA program in the Information Innovation Office (I2O). XDATA is developing an open source software library for big data. DARPA has an open source strategy through XDATA and other I2O programs to help increase the impact of government investments.

DARPA is interested in building communities around government-funded software and research. If the R&D; community shows sufficient interest, DARPA will continue to make available information generated by DARPA programs, including software, publications, data and experimental results. Future updates are scheduled to include components from other I2O programs such as Broad Operational Language Translation (BOLT) and Visual Media Reasoning (VMR).

The DARPA Open Catalog contains two tables:

The Software Table lists performers with one row per piece of software. Each piece of software has a link to an external project page, as well as a link to the code repository for the project. The software categories are listed; in the case of XDATA, they are Analytics, Visualization and Infrastructure. A description of the project is followed by the applicable software license. Finally, each entry has a link to the publications from each team's software entry.
The Publications Table contains author(s), title, and links to peer-reviewed articles related to specific DARPA programs.

Program Manager:
Dr. Christopher White
christopher.white@darpa.mil

Report a problem: opencatalog@darpa.mil

The content below has been generated by organizations that are partially funded by DARPA; the views and conclusions contained therein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.

Software:

XDATA Team	Software	Category	Instructional Material	Code	Dev Stats	Description	License
Aptima Inc.	Network Query by Example	Analytics	2014-07	https://github.com/Aptima/pattern-matching.git	stats	Hadoop MapReduce-over-Hive based implementation of network query by example utilizing attributed network pattern matching.	ALv2
Boeing/Pitt Publications	SMILE-WIDE: A scalable Bayesian network library	Analytics	2014-07	https://github.com/SmileWide/main.git	stats	SMILE-WIDE is a scalable Bayesian network library. Initially, it is a version of the SMILE library, as in SMILE With Integrated Distributed Execution. The general approach has been to provide an API similar to the existing API SMILE developers use to build "local," single-threaded applications. However, we provide "vectorized" operations that hide a Hadoop-distributed implementation. Apart from invoking a few idioms like generic Hadoop command line argument parsing, these appear to the developer as if they were executed locally.	ALv2
Carnegie Mellon University Publications	Support Distribution Machines	Analytics	2014-07	https://github.com/dougalsutherland/py-sdm.git	stats	Python implementation of the nonparametric divergence estimators described by Barnabas Poczos, Liang Xiong, Jeff Schneider (2011). Nonparametric divergence estimation with applications to machine learning on distributions. Uncertainty in Artificial Intelligence. ( http://autonlab.org/autonweb/20287.html ) and also their use in support vector machines, as described by Dougal J. Sutherland, Liang Xiong, Barnabas Poczos, Jeff Schneider (2012). Kernels on Sample Sets via Nonparametric Divergence Estimates. ( http://arxiv.org/abs/1202.0302 ).	BSD
Continuum Analytics	Blaze	Infrastructure	2014-07	https://github.com/ContinuumIO/blaze.git	stats	Blaze is the next-generation of NumPy. It is designed as a foundational set of abstractions on which to build out-of-core and distributed algorithms over a wide variety of data sources and to extend the structure of NumPy itself. Blaze allows easy composition of low level computation kernels (C, Fortran, Numba) to form complex data transformations on large datasets. In Blaze, computations are described in a high-level language (Python) but executed on a low-level runtime (outside of Python), enabling the easy mapping of high-level expertise to data without sacrificing low-level performance. Blaze aims to bring Python and NumPy into the massively-multicore arena, allowing it to leverage many CPU and GPU cores across computers, virtual machines and cloud services.	BSD
Continuum Analytics	Numba	Infrastructure	2014-07	https://github.com/numba/numba.git	stats	Numba is an Open Source NumPy-aware optimizing compiler for Python sponsored by Continuum Analytics, Inc. It uses the LLVM compiler infrastructure to compile Python syntax to machine code. It is aware of NumPy arrays as typed memory regions and so can speed-up code using NumPy arrays. Other, less well-typed code is translated to Python C-API calls effectively removing the "interpreter" but not removing the dynamic indirection. Numba is also not a tracing just in time (JIT) compiler. It compiles your code before it runs either using run-time type information or type information you provide in the decorator. Numba is a mechanism for producing machine code from Python syntax and typed data structures such as those that exist in NumPy.	BSD
Continuum Analytics	Bokeh	Visualization	2014-07	https://github.com/ContinuumIO/bokeh.git	stats	Bokeh (pronounced bo-Kay or bo-Kuh) is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.	BSD
Continuum Analytics and Indiana University Publications	Abstract Rendering	Visualization	2014-07	https://github.com/JosephCottam/AbstractRendering.git	stats	Information visualization rests on the idea that a meaningful relationship can be drawn between pixels and data. This is most often mediated by geometric entities (such as circles, squares and text) but always involves pixels eventually to display. In most systems, the pixels are tucked away under levels of abstraction in the rendering system. Abstract Rendering takes the opposite approach: expose the pixels and gain powerful pixel-level control. This pixel-level power is a complement to many existing visualization techniques. It is an elaboration on rendering, not an analytic or projection step, so it can be used as an epilogue to many existing techniques. In standard rendering, geometric objects are projected to an image and represented on that image's discrete pixels. The source space is an abstract canvas that contains logically continuous geometric primitives and the target space is an image that contains discrete colors. Abstract Rendering fits between these two states. It introduces a discretization of the data at the pixel-level, but not necessarily all the way to colors. This enables many pixel-level concerns to be efficiently and concisely captured.	BSD
Continuum Analytics	CDX	Visualization	2014-07	https://github.com/ContinuumIO/cdx.git	stats	Software to visualize the structure of large or complex datasets / produce guides that help users or algorithms gauge the quality of various kinds of graphs & plots.	BSD
Continuum Analytics and Indiana University Publications	Stencil	Visualization	2014-07	https://github.com/JosephCottam/Stencil.git	stats	Stencil is a grammar-based approach to visualization specification at a higher-level.	BSD
Data Tactics Corporation	Vowpal Wabbit	Analytics	2014-07	https://github.com/JohnLangford/vowpal_wabbit.git	stats	The Vowpal Wabbit (VW) project is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. Support is available through the mailing list. There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it's reached a state where it may be useful to others as a platform for research and experimentation. There are several optimization algorithms available with the baseline being sparse gradient descent (GD) on a loss function (several are available). The code should be easily usable. Its only external dependence is on the boost library, which is often installed by default.	BSD
Data Tactics Corporation	Circuit	Infrastructure	2014-07	https://code.google.com/p/gocircuit/source/checkout		Go Circuit reduces the human development and sustenance costs of complex massively-scaled systems nearly to the level of their single-process counterparts. It is a combination of proven ideas from the Erlang ecosystem of distributed embedded devices and Go's ecosystem of Internet application development. Go Circuit extends the reach of Go's linguistic environment to multi-host/multi-process applications.	ALv2
Georgia Tech / GTRI Publications	libNMF: a high-performance library for nonnegative matrix factorization and hierarchical clustering	Analytics	2014-07	Pending		LibNMF is a high-performance, parallel library for nonnegative matrix factorization on both dense and sparse matrices written in C++. Implementations of several different NMF algorithms are provided, including multiplicative updating, hierarchical alternating least squares, nonnegative least squares with block principal pivoting, and a new rank2 algorithm. The library provides an implementation of hierarchical clustering based on the rank2 NMF algorithm.	ALv2
IBM Research Publications	SKYLARK: Randomized Numerical Linear Algebra and ML	Analytics	2014-07	2014-05-15		SKYLARK implements Numerical Linear Algebra (NLA) kernels based on sketching for distributed computing platforms. Sketching reduces dimensionality through randomization, and includes Johnson-Lindenstrauss random projection (JL); a faster version of JL based on fast transform techniques; sparse techniques that can be applied in time proportional to the number of nonzero matrix entries; and methods for approximating kernel functions and Gram matrices arising in nonlinear statistical modeling problems. We have a library of such sketching techniques, built using MPI in C++ and callable from Python, and are applying the library to regression, low-rank approximation, and kernel-based machine learning tasks, among other problems.	ALv2
Institute for Creative Technologies / USC	Immersive Body-Based Interactions	Visualization	2014-07	http://code.google.com/p/svnmimir/source/checkout	stats	Provides innovative interaction techniques to address human-computer interaction challenges posed by Big Data. Examples include: * Wiggle Interaction Technique: user induced motion to speed visual search. * Immersive Tablet Based Viewers: low cost 3D virtual reality fly-through's of data sets. * Multi-touch interfaces: browsing/querying multi-attribute and geospatial data, hosted by SOLR. * Tablet based visualization controller: eye-free rapid interaction with visualizations.	ALv2
Johns Hopkins University Publications	igraph	Analytics	2014-07	https://github.com/igraph/xdata-igraph.git	stats	igraph provides a fast generation of large graphs, fast approximate computation of local graph invariants, fast parallelizable graph embedding. API and Web-service for batch processing graphs across formats.	GPLv2
Trifacta (Stanford, University of Washington, Kitware, Inc. Team)	Vega	Visualization	2014-07	https://github.com/trifacta/vega.git	stats	Vega is a visualization grammar, a declarative format for creating and saving visualization designs. With Vega you can describe data visualizations in a JSON format, and generate interactive views using either HTML5 Canvas or SVG.	BSD
Kitware, Inc.	Tangelo	Visualization	2014-07	https://github.com/Kitware/tangelo.git	stats	Tangelo provides a flexible HTML5 web server architecture that cleanly separates your web applications (pure Javascript, HTML, and CSS) and web services (pure Python). This software is bundled with some great tools to get you started.	ALv2
Harvard and Kitware, Inc. Publications	LineUp	Visualization	2014-07	https://github.com/Caleydo/org.caleydo.vis.lineup.demos.git	stats	LineUp is a novel and scalable visualization technique that uses bar charts. This interactive technique supports the ranking of items based on multiple heterogeneous attributes with different scales and semantics. It enables users to interactively combine attributes and flexibly refine parameters to explore the effect of changes in the attribute combination. This process can be employed to derive actionable insights as to which attributes of an item need to be modified in order for its rank to change. Additionally, through integration of slope graphs, LineUp can also be used to compare multiple alternative rankings on the same set of items, for example, over time or across different attribute combinations. We evaluate the effectiveness of the proposed multi-attribute visualization technique in a qualitative study. The study shows that users are able to successfully solve complex ranking tasks in a short period of time.	BSD
Harvard and Kitware, Inc. Publications	LineUp Web	Visualization	2014-07	2014-06		LineUpWeb is the web version of the novel and scalable visualization technique. This interactive technique supports the ranking of items based on multiple heterogeneous attributes with different scales and semantics. It enables users to interactively combine attributes and flexibly refine parameters to explore the effect of changes in the attribute combination.	BSD
Stanford, University of Washington, Kitware, Inc.	Lyra	Visualization	2014-07	2014-02		Lyra is an interactive environment that makes custom visualization design accessible to a broader audience. With Lyra, designers map data to the properties of graphical marks to author expressive visualization designs without writing code. Marks can be moved, rotated and resized using handles; relatively positioned using connectors; and parameterized by data fields using property drop zones. Lyra also provides a data pipeline interface for iterative, visual specification of data transformations and layout algorithms. Visualizations created with Lyra are represented as specifications in Vega, a declarative visualization grammar that enables sharing and reuse.	BSD
Phronesis	stat_agg	Analytics	2014-07	https://github.com/kaneplusplus/stat_agg.git	stats	stat_agg is a Python package that provides statistical aggregators that maximize ensemble prediction accuracy by weighting individual learners in an optimal way. When used with the laputa package, learners may be distributed across a cluster of machines. The package also provides fault-tolerance when one or more learners becomes unavailable.	ALv2
Phronesis	flexmem	Infrastructure	2014-07	https://github.com/kaneplusplus/flexmem.git	stats	Flexmem is a general, transparent tool for out-of-core (OOC) computing in the R programming environment. It is launched as a command line utility, taking an application as an argument. All memory allocations larger than a specified threshold are memory-mapped to a binary file. When data are not needed, they are stored on disk. It is both process- and thread-safe.	ALv2
Phronesis	laputa	Infrastructure	2014-07	https://github.com/kaneplusplus/laputa.git	stats	Laputa is a Python package that provides an elastic, parallel computing foundation for the stat_agg (statistical aggregates) package.	ALv2
Phronesis	bigmemory	Infrastructure	2014-07	http://cran.r-project.org/web/packages/bigmemory/index.html		Bigmemory is an R package to create, store, access, and manipulate massive matrices. Matrices are allocated to shared memory and may use memory-mapped files. Packages biganalytics, bigtabulate, synchronicity, and bigalgebra provide advanced functionality.	ALv2
Phronesis	bigalgebra	Infrastructure	2014-07	https://r-forge.r-project.org/scm/viewvc.php/?root=bigmemory		Bigalgebra is an R package that provides arithmetic functions for R matrix and big.matrix objects.	ALv2
MDA Information Systems, Inc., Jet Propulsion Laboratory, USC/Information Sciences Institute	OODT	Infrastructure	2014-07	https://svn.apache.org/repos/asf/oodt/	stats	APACHE OODT enables transparent access to distributed resources, data discovery and query optimization, and distributed processing and virtual archives. OODT provides software architecture that enables models for information representation, solutions to knowledge capture problems, unification of technology, data, and metadata.	ALv2
MDA Information Systems, Inc.,Jet Propulsion Laboratory, USC/Information Sciences Institute	Wings	Infrastructure	2014-07	https://github.com/varunratnakar/wings.git	stats	WINGS provides a semantic workflow system that assists scientists with the design of computational experiments. A unique feature of WINGS is that its workflow representations incorporate semantic constraints about datasets and workflow components, and are used to create and validate workflows and to generate metadata for new data products. WINGS submits workflows to execution frameworks such as Pegasus and OODT to run workflows at large scale in distributed resources.	ALv2
MIT-LL Publications	Query By Example (Graph QuBE)	Analytics	2014-07	2014-02-15		Query-by-Example (Graph QuBE) on dynamic transaction graphs.	ALv2
MIT-LL Publications	Julia	Analytics	2014-07	https://github.com/JuliaLang/julia.git	stats	Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.	MIT,GPL,LGPL,BSD
MIT-LL Publications	Topic	Analytics	2014-07	Pending		Probabilistic Latent Semantic Analysis (pLSA) Topic Modeling.	ALv2
MIT-LL Publications	SciDB	Infrastructure	2014-07	https://github.com/wujiang/SciDB-mirror.git	stats	Scientific Database for large-scale numerical data.	GPLv3
MIT-LL Publications	Information Extractor	Analytics	2014-07	Pending		Trainable named entity extractor (NER) and relation extractor.	ALv2
Next Century Corporation	Ozone Widget Framework	Visualization	2014-07	https://github.com/ozoneplatform/owf.git	stats	Ozone Widget Framework provides a customizable open-source web application that assembles the tools you need to accomplish any task and enables those tools to communicate with each other. It is a technology-agnostic composition framework for data and visualizations in a common browser-based display and interaction environment that lowers the barrier to entry for the development of big data visualizations and enables efficient exploration of large data sets.	ALv2
Next Century Corporation	Neon Visualization Environment	Visualization	2014-07	https://github.com/NextCenturyCorporation/neon.git	stats	Neon is a framework that gives a datastore agnostic way for visualizations to query data and perform simple operations on that data such as filtering, aggregation, and transforms. It is divided into two parts, neon-server and neon-client. Neon-server provides a set of RESTful web services to select a datastore and perform queries and other operations on the data. Neon-client is a javascript API that provides a way to easily integrate neon-server capabilities into a visualization, and also aids in 'widgetizing' a visualization, allowing it to be integrated into a common OWF based ecosystem.	ALv2
Oculus Info Inc. Publications	ApertureJS	Visualization	2014-07	https://github.com/oculusinfo/aperturejs.git	stats	ApertureJS is an open, adaptable and extensible JavaScript visualization framework with supporting REST services, designed to produce visualizations for analysts and decision makers in any common web browser. Aperture utilizes a novel layer based approach to visualization assembly, and a data mapping API that simplifies the process of adaptable transformation of data and analytic results into visual forms and properties. Aperture vizlets can be easily embedded with full interoperability in frameworks such as the Ozone Widget Framework (OWF).	MIT
Oculus Info Inc. Publications	Influent	Visualization	2014-07	https://github.com/oculusinfo/influent.git	stats	Influent is an HTML5 tool for visually and interactively following transaction flow, rapidly revealing actors and behaviors of potential concern that might otherwise go unnoticed. Summary visualization of transactional patterns and actor characteristics, interactive link expansion and dynamic entity clustering enable Influent to operate effectively at scale with big data sources in any modern web browser. Influent has been used to explore data sets with millions of entities and hundreds of millions of transactions.	MIT
Oculus Info Inc. Publications	Aperture Tile-Based Visual Analytics	Visualization	2014-07	https://github.com/oculusinfo/aperture-tiles.git	stats	New tools for raw data characterization of 'big data' are required to suggest initial hypotheses for testing. The widespread use and adoption of web-based maps has provided a familiar set of interactions for exploring abstract large data spaces. Building on these techniques, we developed tile based visual analytics that provide browser-based interactive visualization of billions of data points.	MIT
Oculus Info Inc. Publications	Oculus Ensemble Clustering	Analytics	2014-07	https://github.com/oculusinfo/ensemble-clustering.git	stats	Oculus Ensemble Clustering is a flexible multi-threaded clustering library for rapidly constructing tailored clustering solutions that leverage the different semantic aspects of heterogeneous data. The library can be used on a single machine using multi-threading or distributed computing using Spark.	MIT
Raytheon BBN	Content and Context-based Graph Analysis: PINT, Patterns in Near-Real Time	Analytics	2014-07	https://github.com/plamenbbn/XDATA.git	stats	Patterns in Near-Real Time will take any corpus as input and quantify the strength of the query match to a SME-based process model, represent process model as a Directed Acyclic Graph (DAG), and then search and score potential matches.	ALv2
Raytheon BBN	Content and Context-based Graph Analysis: NILS, Network Inference of Link Strength	Analytics	2014-07	https://github.com/plamenbbn/XDATA.git	stats	Network Inference of Link Strength will take any text corpus as input and quantify the strength of connections between any pair of entities. Link strength probabilities are computed via shortest path.	ALv2
Royal Caliber Publications	GPU based Graphlab style Gather-Apply-Scatter (GAS) platform for quickly implementing and running graph algorithms	Analytics	2014-07	https://github.com/RoyalCaliber/vertexAPI2.git	stats	Allows users to express graph algorithms as a series of Gather-Apply-Scatter (GAS) steps similar to GraphLab. Runs these vertex programs using a single or multiple GPUs - demonstrates a large speedup over GraphLab.	ALv2
Scientific Systems Company, Inc., MIT, and University of Louisville	BayesDB	Analytics	2014-07	https://github.com/mit-probabilistic-computing-project/BayesDB.git	stats	BayesDB is an open-source implementation of a predictive database table. It provides predictive extensions to SQL that enable users to query the implications of their data --- predict missing entries, identify predictive relationships between columns, and examine synthetic populations --- based on a Bayesian machine learning system in the backend.	ALv2
Scientific Systems Company, Inc., MIT, and University of Louisville	Crosscat	Analytics	2014-07	https://github.com/mit-probabilistic-computing-project/crosscat.git	stats	CrossCat is a domain-general, Bayesian method for analyzing high-dimensional data tables. CrossCat estimates the full joint distribution over the variables in the table from the data via approximate inference in a hierarchical, nonparametric Bayesian model, and provides efficient samplers for every conditional distribution. CrossCat combines strengths of nonparametric mixture modeling and Bayesian network structure learning: it can model any joint distribution given enough data by positing latent variables, but also discovers independencies between the observable variables.	ALv2
Sotera Defense Solutions, Inc. Publications	Zephyr	Infrastructure	2014-07	http://github.com/Sotera/zephyr	stats	Zephyr is a big data, platform agnostic ETL API, with Hadoop MapReduce, Storm, and other big data bindings.	ALv2
Sotera Defense Solutions, Inc. Publications	Page Rank	Analytics	2014-07	https://github.com/Sotera/page-rank.git	stats	Sotera Page Rank is a Giraph/Hadoop implementation of a distributed version of the Page Rank algorithm.	ALv2
Sotera Defense Solutions, Inc. Publications	Louvain Modularity	Analytics	2014-07	https://github.com/Sotera/distributed-louvain-modularity.git	stats	Giraph/Hadoop implementation of a distributed version of the Louvain community detection algorithm.	ALv2
Sotera Defense Solutions, Inc. Publications	Spark MicroPath	Analytics	2014-07	https://github.com/Sotera/aggregate-micro-paths.git		The Spark implementation of the micropath analytic.	ALv2
Sotera Defense Solutions, Inc. Publications	ARIMA	Analytics	2014-07	https://github.com/Sotera/rhipe-arima	stats	Hive and RHIPE implementation of an ARIMA analytic.	ALv2
Sotera Defense Solutions, Inc. Publications	Leaf Compression	Analytics	2014-07	https://github.com/Sotera/leaf-compression.git	stats	Recursive algorithm to remove nodes from a network where degree centrality is 1.	ALv2
Sotera Defense Solutions, Inc. Publications	Correlation Approximation	Analytics	2014-07	https://github.com/Sotera/correlation-approximation	stats	Spark implementation of an algorithm to find highly correlated vectors using an approximation algorithm.	ALv2
Stanford University - Boyd Publications	QCML (Quadratic Cone Modeling Language)	Analytics	2014-07	https://github.com/cvxgrp/qcml.git	stats	Seamless transition from prototyping to code generation. Enable ease and expressiveness of convex optimization across scales with little change in code.	ALv2
Stanford University - Boyd Publications	PDOS (Primal-dual operator splitting)	Analytics	2014-07	https://github.com/cvxgrp/pdos.git	stats	Concise algorithm for solving convex problems; solves problems passed from QCML.	ALv2
Stanford University - Boyd Publications	SCS (Self-dual Cone Solver)	Analytics	2014-07	https://github.com/cvxgrp/scs.git	stats	Implementation of a solver for general cone programs, including linear, second-order, semidefinite and exponential cones, based on an operator splitting method applied to a self-dual homogeneous embedding. The method and software supports both direct factorization, with factorization caching, and an indirect method, that requires only the operator associated with the problem data and its adjoint. The implementation includes interfaces to CVX, CVXPY, matlab, as well as test routines. This code is described in detail in an associated paper, at http://www.stanford.edu/~boyd/papers/pdos.html (which also links to the code).	ALv2
Stanford University - Boyd Publications	ECOS: An SOCP Solver for Embedded Systems	Analytics	2014-07	https://github.com/ifa-ethz/ecos.git	stats	ECOS is a lightweight primal-dual homogeneous interior-point solver for SOCPs, for use in embedded systems as well as a base solver for use in large scale distributed solvers. It is described in the paper at http://www.stanford.edu/~boyd/papers/ecos.html.	ALv2
Stanford University - Boyd Publications	Proximal Operators	Analytics	2014-07	https://github.com/cvxgrp/proximal.git	stats	This library contains sample implementations of various proximal operators in Matlab. These implementations are intended to be pedagogical, not the most performant. This code is associated with the paper Proximal Algorithms by Neal Parikh and Stephen Boyd.	ALv2
Stanford University - Hanrahan Publications	imMens	Visualization	2014-07	https://github.com/StanfordHCI/imMens.git	stats	imMens is a web-based system for interactive visualization of large databases. imMens uses binned aggregation to produce summary visualizations that avoid the shortcomings of standard sampling-based approaches. Through data decomposition methods (to limit data transfer) and GPU computation via WebGL (for parallel query processing), imMens enables real-time (50fps) visual querying of billion+ element databases.	BSD
Stanford University - Hanrahan Publications	trelliscope	Visualization	2014-07	https://github.com/hafen/trelliscope.git	stats	Trellis Display, developed in the 90s, also divides the data. A visualization method is applied to each subset and shown on one panel of a multi-panel trellis display. This framework is a very powerful mechanism for all data, large and small. Trelliscope, a layer that uses datadr, extends Trellis to large complex data. An interactive viewer is available for viewing subsets of very large displays, and the software provides the capability to sample subsets of panels from rigorous sampling plans. Sampling is often necessary because in most applications, there are too many subsets to look at them all.	BSD
Stanford University - Hanrahan Publications	RHIPE: R and Hadoop Integrated Programming Environment	Infrastructure	2014-07	https://github.com/saptarshiguha/RHIPE.git	stats	In Divide and Recombine (D&R;), big data are divided into subsets in one or more ways, forming divisions. Analytic methods, numeric-categorical methods of machine learning and statistics plus visualization methods, are applied to each of the subsets of a division. Then the subset outputs for each method are recombined. D&R; methods of division and recombination seek to make the statistical accuracy of recombinations as large as possible, ideally close to that of the hypothetical direct, all-data application of the methods. The D&R; computational environment starts with RHIPE, a merger of R and Hadoop. RHIPE allows an analyst to carry out D&R; analysis of big data wholly from within R, and use any of the thousands of methods available in R. RHIPE communicates with Hadoop to carry out the big, parallel computations.	ALv2
Stanford University - Hanrahan Publications	Riposte	Analytics	2014-07	https://github.com/jtalbot/riposte.git	stats	Riposte is a fast interpreter and JIT for R. The Riposte VM has 2 cooperative subVMs for R scripting (like Java) and for R vector computation (like APL). Our scripting code has been 2-4x faster in Riposte than in R's recent bytecode interpreter. Vector-heavy code is 5-10x faster. Speeding up R can greatly increases the analyst's efficiency.	BSD
Stanford University - Olukotun Publications	Delite	Infrastructure	2014-07	https://github.com/stanford-ppl/Delite.git	stats	Delite is a compiler framework and runtime for parallel embedded domain-specific languages (DSLs).	BSD
Stanford University - Olukotun Publications	SNAP	Infrastructure	2014-07	https://github.com/snap-stanford/snap	stats	Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library. It is written in C++ and easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges.	BSD
SYSTAP, LLC	bigdata	Infrastructure	2014-07	https://bigdata.svn.sourceforge.net/svnroot/bigdata/	stats	Bigdata enables massively parallel graph processing on GPUs and many core CPUs. The approach is based on the decomposition of a graph algorithm as a vertex program. The initial implementation supports an API based on the GraphLab 2.1 Gather Apply Scatter (GAS) API. Execution is available on GPUs, Intel Xenon Phi (aka MIC), and multi-core GPUs.	GPLv2
SYSTAP, LLC	mpgraph	Analytics	2014-07	http://svn.code.sf.net/p/mpgraph/code/	stats	Mpgraph enables massively parallel graph processing on GPUs and many core CPUs. The approach is based on the decomposition of a graph algorithm as a vertex program. The initial implementation supports an API based on the GraphLab 2.1 Gather Apply Scatter (GAS) API. Execution is available on GPUs, Intel Xenon Phi (aka MIC), and multi-core GPUs.	ALv2
UC Davis	Gunrock	Analytics	2014-07	https://github.com/gunrock/gunrock.git	stats	Gunrock is a CUDA library for graph primitives that refactors, integrates, and generalizes best-of-class GPU implementations of breadth-first search, connected components, and betweenness centrality into a unified code base useful for future development of high-performance GPU graph primitives.	ALv2
Draper Laboratory Publications	Analytic Activity Logger	Infrastructure	2014-07	https://github.com/draperlab/xdatalogger.git	stats	Analytic Activity Logger is an API that creates a common message passing interface to allow heterogeneous software components to communicate with an activity logging engine. Recording a user's analytic activities enables estimation of operational context and workflow. Combined with psychophysiology sensing, analytic activity logging further enables estimation of the user's arousal, cognitive load, and engagement with the tool.	ALv2
University of California, Berkeley Publications	BDAS	Infrastructure	2014-07	N/A		BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.	ALv2, BSD
University of California, Berkeley Publications	Spark	Infrastructure	2014-07	https://github.com/mesos/spark.git	stats	Apache Spark is an open source cluster computing system that aims to make data analytics both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop. To make programming faster, Spark provides clean, concise APIs in Python, Scala and Java. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets.	ALv2
University of California, Berkeley Publications	Shark	Infrastructure	2014-07	https://github.com/amplab/shark.git	stats	Shark is a large-scale data warehouse system for Spark that is designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones.	ALv2
University of California, Berkeley Publications	BlinkDB	Infrastructure	2014-07	https://github.com/sameeragarwal/blinkdb.git	stats	BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time, and (2) A dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy and/or response time requirements. We have evaluated BlinkDB on the well-known TPC-H benchmarks, a real-world analytic workload derived from Conviva Inc. and are in the process of deploying it at Facebook Inc.	ALv2
University of California, Berkeley Publications	Mesos	Infrastructure	2014-07	https://git-wip-us.apache.org/repos/asf/mesos.git	stats	Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.	ALv2
University of California, Berkeley Publications	Tachyon	Infrastructure	2014-07	https://github.com/amplab/tachyon.git	stats	Tachyon is a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read.	BSD
University of Southern California Publications	goffish	Infrastructure	2014-07	https://github.com/usc-cloud/goffish.git	stats	The GoFFish project offers a distributed framework for storing timeseries graphs and composing graph analytics. It takes a clean-slate approach that leverages best practices and patterns from scalable data analytics such as Hadoop, HDFS, Hive, and Giraph, but with an emphasis on performing native analytics on graph (rather than tuple) data structures. This offers an more intuitive storage, access and programming model for graph datasets while also ensuring performance optimized for efficient analysis over large graphs (millions-billions of vertices) and many instances of them (thousands-millions of graph instances).	ALv2

Publications:

XData Team	Title	Link
Boeing/Pitt	Impact of precision of Bayesian network parameters on accuracy of medical diagnostic systems	http://www.ncbi.nlm.nih.gov/pubmed/23466438
Boeing/Pitt	An Empirical Comparison of Bayesian Network Parameter	http://d-scholarship.pitt.edu/19109/
Carnegie Mellon University	Efficient Learning on Point Sets	http://www.autonlab.org/autonweb/21880.html
Carnegie Mellon University	Learning from Point Sets with Observational Bias	http://www.cs.cmu.edu/~schneide/cond-div.pdf
Carnegie Mellon University	On Learning from Collective Data	http://www.cs.cmu.edu/~schneide/xiong_PhD_draft.pdf
Carnegie Mellon University	More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server	http://reports-archive.adm.cs.cmu.edu/anon/ml2013/CMU-ML-13-103.pdf
Carnegie Mellon University	A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks	http://www.cs.cmu.edu/~junmingy/papers/Yin-Ho-Xing-NIPS13.pdf
Carnegie Mellon University	Parallel Markov Chain Monte Carlo for Nonparametric Mixture Models	http://www.cs.cmu.edu/~epxing/papers/2013/Dubey_Williamson_Xing_ICML13.pdf
Continuum Analytics and Indiana University	Overplotting: Unified solutions under Abstract Rendering	http://www.crest.iu.edu/publications/prints/2013/Cottam2013AR.pdf
Continuum Analytics and Indiana University	Abstract Rendering: Out-of-core Rendering for Information Visualization	To appear in SPIE: Visualization and Data Analysis (VDA) 2014
Georgia Tech / GTRI	To Gather Together for a Better World: Understanding and Leveraging Communities in Micro-lending Recommendation	https://smartech.gatech.edu/bitstream/handle/1853/49249/GT-CSE-2013-05.pdf?sequence=1
Georgia Tech / GTRI	A Better World for All: Understanding and Promoting Micro-finance Activities in Kiva.org	https://smartech.gatech.edu/bitstream/handle/1853/49182/GT-CSE-2013-03.pdf
Georgia Tech / GTRI	UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization	http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6634167
Georgia Tech / GTRI	Dyadic Event Attribution in Social Networks with Mixtures of Hawkes Processes	http://www.cc.gatech.edu/~zha/papers/km0600s-li-1.pdf
Georgia Tech / GTRI	Scalable Influence Estimation in Continuous-Time Diffusion Networks	http://papers.nips.cc/paper/4857-scalable-influence-estimation-in-continuous-time-diffusion-networks
Georgia Tech / GTRI	Uncover Topic-Sensitive Information Diffusion Networks	http://jmlr.org/proceedings/papers/v31/du13a.pdf
Georgia Tech / GTRI	Hierarchical Clustering of Hyperspectral Images using Rank-Two Nonnegative Matrix Factorization	http://arxiv.org/abs/1310.7441
Georgia Tech / GTRI	Fast rank-2 nonnegative matrix factorization for hierarchical document clustering	http://dl.acm.org/citation.cfm?id=2487575.2487606; http://www.cc.gatech.edu/grads/d/dkuang3/pub/fp0269-kuang.pdf
Georgia Tech / GTRI	Augmenting MATLAB with Semantic Objects for an Interactive Visual Environment	http://poloclub.gatech.edu/idea2013/papers/p64-lee.pdf
Georgia Tech / GTRI	Mixture of Mutually Exciting Processes for Viral Diffusion	http://machinelearning.wustl.edu/mlpapers/paper_files/yang13a.pdf
Georgia Tech / GTRI	Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional Hawkes Processes	http://jmlr.org/proceedings/papers/v31/zhou13a.pdf
Georgia Tech / GTRI	Learning Triggering Kernels for Multi-dimensional Hawkes Processes	http://jmlr.org/proceedings/papers/v28/zhou13.pdf
IBM Research	Random Projections for Support Vector Machines	http://arxiv.org/pdf/1211.6085
IBM Research	Efficient Dimensionality Reduction for Canonical Correlation Analysis	http://arxiv.org/pdf/1209.2185
IBM Research	Improved matrix algorithms via the Subsampled Randomized Hadamard Transform	http://arxiv.org/pdf/1204.0062
IBM Research	Near-optimal Coresets For Least-Squares Regression	http://arxiv.org/pdf/1202.3505
IBM Research	Deterministic Feature Selection for K-means Clustering	http://arxiv.org/pdf/1109.5664
IBM Research	Low-rank Approximation and Regression in Input Sparsity Time	http://arxiv.org/pdf/1207.6365
IBM Research	Subspace Embeddings and L_p-Regression Using Exponential Random Variables	http://arxiv.org/pdf/1304.6475v2.pdf
IBM Research	Revisiting Asynchronous Linear Solvers: Provable Convergence Rate Through Randomization	http://arxiv.org/pdf/1304.6475v2.pdf
IBM Research	Highly Scalable Linear Time Estimation of Spectrograms - A Tool for Very Large Scale Data Analysis	To appear
IBM Research	Near-Optimal Column-Based Matrix Reconstruction	http://arxiv.org/pdf/1103.0995v3
IBM Research	Faster Subset Selection for Matrices and Applications	http://arxiv.org/pdf/1201.0127v4.pdf
IBM Research	Sketching Structured Matrices for Faster Nonlinear Regression Haim Avron	To appear
IBM Research	Quantile Regression for Large-scale Applications	http://arxiv.org/pdf/1305.0087
Johns Hopkins University	Locality statistics for anomaly detection in time series of graphs	http://arxiv.org/abs/1306.0267
Johns Hopkins University	Universally consistent vertex classification for latent positions graphs	http://arxiv.org/abs/1212.1182
Johns Hopkins University	Seeded graph matching for large stochastic block model graphs	http://arxiv.org/pdf/1310.1297.pdf
Johns Hopkins University	Perfect Clustering for Stochastic Blockmodel Graphs via Adjacency Spectral Embedding	http://arxiv.org/pdf/1310.0532.pdf
Johns Hopkins University	Out-of-sample Extension for Latent Position Graphs	http://arxiv.org/abs/1305.4893
Johns Hopkins University	Generalized Canonical Correlation Analysis for Classification in High Dimensions	http://arxiv.org/abs/1304.7981
Johns Hopkins University	Seeded graph matching for correlated Erdos-Renyi graphs	http://arxiv.org/abs/1304.7844
Johns Hopkins University	On the Incommensurability Phenomenon	http://arxiv.org/abs/1301.1954
Johns Hopkins University	Vertex Nomination Schemes for Membership Prediction	http://arxiv.org/abs/1312.2638
Johns Hopkins University	Robust Vertex Classification	http://arxiv.org/abs/1311.5954
Johns Hopkins University	Consistent Latent Position Estimation and Vertex Classification for Random Dot Product Graphs	http://arxiv.org/abs/1207.6745
Johns Hopkins University	A limit theorem for scaled eigenvectors of random dot product graphs	http://arxiv.org/abs/1305.7388
Johns Hopkins University	Statistical inference on errorfully observed graphs	http://arxiv.org/abs/1211.3601
Johns Hopkins University	Seeded Graph Matching	http://arxiv.org/abs/1209.0367
Harvard	Graphlet decomposition of a weighted network	http://www.people.fas.harvard.edu/~airoldi/pub/journals/j024.AzariAiroldi2012JMLRWCP.pdf
Harvard and Kitware, Inc.	Entourage: Visualizing Relationships between Biological Pathways using Contextual Subsets	http://people.seas.harvard.edu/~alex/papers/2013_infovis_entourage.pdf
Harvard and Kitware, Inc.	LineUp: Visual Analysis of Multi-Attribute Rankings	http://people.seas.harvard.edu/~alex/papers/2013_infovis_lineup.pdf
MDA Information Systems, Inc., University of Southern California	Unlocking Big Data	http://issuu.com/kmi_media_group/docs/gif_11-5_final/27
MDA Information Systems, Inc., University of Southern California	Mapping Semantic Workflows to Alternative Workflow Execution Engines. Gil, Y.	http://www.isi.edu/~gil/papers/gil-icsc13.pdf
MDA Information Systems, Inc., University of Southern California	Capturing Data Analytics and Visualization Expertise with Workflows	http://www.isi.edu/~gil/papers/kale-etal-aaaifss13.pdf
MDA Information Systems, Inc., University of Southern California	Time-Bound Analytic Tasks on Large Datasets through Dynamic	http://www.isi.edu/~gil/papers/gil-etal-works13.pdf
MDA Information Systems, Inc., University of Southern California	> Configuration of Workflows	http://www.isi.edu/~gil/papers/kale-etal-aaaifss13.pdf
MDA Information Systems, Inc., University of Southern California	Large-Scale Multimedia Content Analysis Using Scientific Workflows. Jo, H, Sethi, r., Philpot, A., and Gil, Y.	http://www.isi.edu/~gil/papers/sethi-etal-mm13.pdf
MIT-LL	Content + Context Networks for User Classification in Twitter	http://snap.stanford.edu/networks2013/papers/netnips2013_submission_3.pdf
MIT-LL	Combining Content, Network and Profile Features for User Classification in Twitter	http://www.ll.mit.edu/mission/cybersec/publications/HLTpublications.html
Oculus Info Inc.	Visual Thinking Design Patterns	http://www.oculusinfo.com/assets/pdfs/papers/Ware_Et_Al_VTDP_2013.pdf
Oculus Info Inc.	Aperture: An Open Web 2.0 Visualization Framework	http://www.oculusinfo.com/assets/pdfs/papers/HICSS_Aperture_Framework.pdf
Oculus Info Inc.	Tile Based Visual Analytics for Twitter Big Data Exploratory Analysis	http://www.oculusinfo.com/assets/pdfs/papers/Submitted_Oculus_Big_Data_Twitter_Plots_23Aug2013.pdf
Oculus Info Inc.	Interactive Data Exploration with 'Big Data Tukey Plots',	http://www.oculusinfo.com/assets/pdfs/papers/Submitted_Oculus_Big_Data_Scatter_Plot_EDA_9Aug2013_Final_better.pdf
Oculus Info Inc.	Louvain Clustering for Big Data Graph Visual Analytics	http://www.oculusinfo.com/assets/pdfs/papers/Submitted_Oculus_Big_Data_Louvain_Clustering_9Aug2013_Final.pdf
Scientific Systems Company, Inc., MIT, and University of Lousville	Advanced Machine Learning and Statistical Inference Approaches for Big Data Analytics and Information Fusion	To appear
Sotera Defense Solutions, Inc.	Correlation Using Pair-wise Combinations of Multiple Data Sources and Dimensions at Ultra-Large Scales	To appear
Sotera Defense Solutions, Inc.	Data in the Aggregate: Discovering Honest Signals and Predictable Patterns within Ultra Large Data Sets	https://github.com/Sotera/aggregate-micro-paths/blob/master/AggregateMicropathing_draft.pdf?raw=true
Stanford University - Hanrahan, Purdue , PNNL	Large-Scale Exploratory Analysis, Cleaning, and Modeling for Event Detection in Real-World Power Systems Data	http://ml.stat.purdue.edu/gaby/BigData.ExploreCleanModel.2013.pdf
Stanford University - Hanrahan, Purdue , PNNL	EDA and ML - A Perfect Pair for Large-Scale Data Analysis	http://ml.stat.purdue.edu/gaby/MLandEDAforBigData.pdf
Stanford University - Hanrahan, Purdue , PNNL	Power Grid Data Analysis with R and Hadoop	http://ml.stat.purdue.edu/gaby/RHadoop.PowerGridDataAnalysis.2013.pdf
Stanford University - Hanrahan, Purdue , PNNL	imMens: Real-time Visual Querying of Big Data	http://ml.stat.purdue.edu/gaby/imMensEuroVis.2013.pdf
Stanford University - Boyd	Proximal Algorithms	http://www.stanford.edu/~boyd/papers/prox_algs.html
Stanford University - Boyd	A Primal-Dual Operator Splitting Method for Conic Optimization	http://www.stanford.edu/~boyd/papers/pdos.html
Stanford University - Boyd	Operator Splitting for Conic Optimization via Homogeneous Self-Dual Embedding	http://www.stanford.edu/~boyd/papers/scs.html
Stanford University - Boyd	ECOS: An SOCP Solver for Embedded Systems	http://www.stanford.edu/~boyd/papers/ecos.html
Stanford University - Boyd	Code Generation for Embedded Second-Order Cone Programming	http://www.stanford.edu/~boyd/papers/ecos_codegen_ecc.html
Stanford University - Hanrahan, Purdue , PNNL	Trelliscope: A System for Detailed Visualization in the Deep Analysis of Large Complex Data	http://ml.stat.purdue.edu/gaby/trelliscope.ldav.2013.pdf
Stanford University - Olukotun	NIFTY: A System for Large Scale Information Flow Tracking and Clustering	http://www.stanford.edu/~shhuang/papers/nifty_www2013.pdf
Stanford University - Olukotun	Composition and Reuse with Compiled Domain-Specific Languages	http://ppl.stanford.edu/papers/ecoop13_sujeeth.pdf
Stanford University - Olukotun	Dimension Independent Similarity Computation	http://jmlr.org/papers/v14/bosagh-zadeh13a.html
Stanford University - Olukotun	On the precision of social and information networks	http://doi.acm.org/10.1145/2512938.2512955
Stanford University - Olukotun	Forge: Generating a High Performance DSL Implementation from a Declarative Specification	http://dl.acm.org/citation.cfm?id=2517220
The New School	Data Visualization for Big Data (Goranson	https://www.dropbox.com/sh/ea6ya5cpnxreuak/Ae6tpC9L30/Data_Visualization_for_Big_Data_Parsons.pdf
The New School	IAM - Incremental Agent-Based Mapping	https://www.dropbox.com/sh/ea6ya5cpnxreuak/NM7qycXc7l/IAM_Cognitive_Mapping_Thesis_Parsons.pdf
The New School	Expediting Cooperation in Government funded Open Source Programs: Incremental Agent-based Mapping, a Pattern Language for Collaborative Cognition	https://www.dropbox.com/sh/ea6ya5cpnxreuak/F6INhNz-PE/IAM_DARPA_FIN_Compiled_Parsons.pdf
The New School	Design Methodology of the XDATA Program	https://www.dropbox.com/sh/ea6ya5cpnxreuak/RyhknZGYPS/XDATA_Design_Methodology_Parsons.pdf
The New School	Data Visualization Design Guidelines	https://www.dropbox.com/sh/ea6ya5cpnxreuak/T71QA5z_iI/Data_Visualization_Design_Guidelines_Parsons_FIN.pdf
The New School	Big Data and Knowledge Discovery Through Metapictorial Visualization	https://www.dropbox.com/sh/ea6ya5cpnxreuak/q6h5q76tfY/Big_Data_and_Knowledge_Discovery_Metapictorial_Visualization_Parsons.pdf
The New School	Design and Visualization Best Practices for Big Data: Enhancing Data Discovery through Improved Usability	https://www.dropbox.com/sh/ea6ya5cpnxreuak/V5fag28-jZ/XDATA_GUI_Design_Volume_I_Parsons.pdf
University of California, Berkeley	Carat: Collaborative Energy Diagnosis for Mobile Devices	https://amplab.cs.berkeley.edu/publication/carat-sensys/
University of California, Berkeley	Discretized Streams: Fault-Tolerant Streaming Computation at Scale	http://dl.acm.org/citation.cfm?doid=2517349.2522737
University of California, Berkeley	Sparrow: Distributed, Low Latency Scheduling	https://amplab.cs.berkeley.edu/publication/sparrow-distributed-low-latency-scheduling/
University of California, Berkeley	A General Bootstrap Performance Diagnostic	https://amplab.cs.berkeley.edu/publication/a-general-bootstrap-performance-diagnostic/
University of California, Berkeley	MLI: An API for Distributed Machine Learning	https://amplab.cs.berkeley.edu/publication/mli-an-api-for-distributed-machine-learning/
University of California, Berkeley	Leveraging Endpoint Flexibility in Data-Intensive Clusters	https://amplab.cs.berkeley.edu/publication/leveraging-endpoint-flexibility-in-data-intensive-clusters/
University of California, Berkeley	Shark: SQL and Rich Analytics at Scale	https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/
University of California, Berkeley	GraphX: A Resilient Distributed Graph System on Spark	https://amplab.cs.berkeley.edu/publication/graphx-grades/
University of California, Berkeley	RTP: Robust Tenant Placement for Elastic In-Memory Database Clusters	https://amplab.cs.berkeley.edu/publication/rtp-robust-tenant-placement-for-elastic-in-memory-database-clusters/
University of California, Berkeley	Bolt-on Causal Consistency	https://amplab.cs.berkeley.edu/publication/bolt-on-causal-consistency/
University of California, Berkeley	BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data	https://amplab.cs.berkeley.edu/publication/blinkdb-queries-with-bounded-errors-and-bounded-response-times-on-very-large-data/
University of California, Berkeley	MDCC: Multi-Data Center Consistency	https://amplab.cs.berkeley.edu/publication/mdcc-multi-data-center-consistency/
University of California, Berkeley	The Case for Tiny Tasks in Compute Clusters	https://amplab.cs.berkeley.edu/publication/the-case-for-tiny-tasks-in-compute-clusters/
University of California, Berkeley	Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices	https://amplab.cs.berkeley.edu/publication/presto-distributed-machine-learning-and-graph-processing-with-sparse-matrices/
University of California, Berkeley	MLbase: A Distributed Machine-learning System	https://amplab.cs.berkeley.edu/publication/mlbase-a-distributed-machine-learning-system/
University of California, Berkeley	Coflow: A Networking Abstraction for Cluster Applications	https://amplab.cs.berkeley.edu/publication/the-potential-dangers-of-causal-consistency-and-an-explicit-solution/
Royal Caliber	VertexAPI2 - A Vertex-Program API for Large Graph Computations on the GPU	http://www.royal-caliber.com/vertexapi2.pdf
Draper Laboratory	Measuring the value of big data exploitation systems: quantitative, non-subjective metrics with the user as a key component	http://pjim.newschool.edu/issues/2014/01/

FOIA
Privacy and Security
No Fear Act
Accessibility/Section 508