The Mind Wanders

On my walk to the Berkeley campus this morning I brushed my fingers through the leaves of an aromatic shrub, and then inhaled the familiar fragrance. I do this every day, and every day the first word that stands up and waves its arms around demanding my attention is sage. But I know the plant is not sage, it’s rosemary, so I tell sage to sit down. Too late. With rosemary and sage already front and center, I can’t stop parsley and thyme from sauntering onstage to join them, and then along come the first bars of the melody and the faces on the album cover, and I’m back in the middle of the 1960s wearing a paisley shirt. Meanwhile, rosemary evokes Rosemary Woods and the \(13\)-minute gap (though I now know, after consulting the global collective memory, that it’s Rose Mary Woods and the \(18\frac{1}{2}\)-minute gap). From Watergate I leap forward to stories on this morning’s front page. Then I notice another plant in a well-tended garden, with furry gray-green leaves. This one isn’t sage either; it’s lamb’s ear. Nevertheless, sage finally has its moment in the spotlight. From the herb I move on to the mathematical software called Sage, and then to SAGE, the Semi-Automatic Ground Environment, the 1950s air-defense system controlled by the largest computer ever built.

In psychology and literature, this kind of mental rambling is called stream of consciousness, a metaphor we owe to William James. It’s not the metaphor I would have chosen. My own consciousness, as I experience it, does not flow smoothly from one topic to the next but seems to flit across a landscape of ideas, more like a butterfly than a river, sometimes alighting daintily on one flower and then the next, sometimes carried away by gusts of wind, sometimes revisiting favorite spots over and over.

As a way of probing the architecture of my own memory, I have tried a more deliberate experiment in free association. I began with the same herbal recipe—parsley, sage, rosemary, and thyme—but for this exercise I wasn’t strolling through the garden spots of the Berkeley hills; I was sitting at a desk taking notes. The diagram below is my best effort at reconstructing the complete train of thought.

parsley, sage,rosemary, thyme Four herbs; also a line in the refrain of a Simon and Garfunkel song. Simon and Garfunkel Paul Simon and Art Garfunkel, folk-rock singing duo of the 1960s and 70s. Song by Simon and Garfunkel, and a character in the Mike Nichols movie "The Graduate." “Mrs. Robinson”Song by Simon and Garfunkel, and a character in the Mike Nichols movie "The Graduate." “Where have you gone,Joe DiMaggio?” Question asked in 'Mrs. Robinson'. Simon and Schuster Publishing house founded in 1924 by Richard Simon and Max Schuster (initially to publish crossword puzzles). Jackie Robinson Legendary Brooklyn Dodger. Robinson Crusoe Shipwreck novel by Daniel Defoe (1719). Swiss Family Robinson Shipwreck novel by Johann David Wyss (1812). herbs Flavorful plants. Mr. Wizard 1950s Saturday-morning science show for children; host: Don Herbert. Alpert Trumpeter Herb Alpert. “Plastics” Career advice offered in "The Graduate." coo-coo-ca-choo Line in "Mrs. Robinson." Frank Robinson Outfielder for the Baltimore Orioles circa 1970. Graig Nettles Third baseman for the New York Yankees in the 1970s. Dustin Hoffman Actor who played the title role in "The Graduate." Abby Hoffman "Yipee!" Leominster Massachusetts city that was the cradle of the American plastics industry. Brooks Robinson Third baseman for the Baltimore Orioles in the 1970s. Papillon 1973 film in which Dustin Hoffman had a supporting role; "papillon" is French for "butterfly." Nabokov Vladimir Nabokov, Russian-born author and lepidopterist. butterfly, Schmetterling,mariposa, farfalla Words for "butterfly" in English, German, Spanish, and Italian; all of them (and the French) seem to be of independent origin. What’s the Russian for butterfly? I don't know. Or I didn't when the question arose. “I am the Walrus” Beatles song from 1967 that also includes the phrase "coo-coo-ca-choo." Carly Simon Singer. No relation to Paul Simon, but daughter of Richard Simon. “You’re so vain” Song by Carly Simon. X

Scrolling through the chart from top to bottom reveals the items in the order my brain presented them to me, but the linkages between nodes do not form a single linear sequence. Instead the structure is treelike, with short chains of sequential associations ending with an abrupt return to an earlier node, as if I were being snapped back by a rubber band. These interruptions are marked in the diagram by green upward arrows; the red “X” at the bottom is where I decided to end the experiment.

My apologies to the half of humanity born since 1990, who will doubtless find many of the items mentioned in the diagram antiquated or obscure. You can hover over the labels for pop-up explanations, although I doubt they will make the associations any more meaningful. Memories, after all, are personal; they live inside your head. If you want a collection of ideas that resonate with your own experience, you’ll just have to create your own free-association diagram. I highly recommend it: You may discover something you didn’t know you knew.

The destination of my daily walk down the hill in Berkeley is the Simons Institute for the Theory of Computing, where I am immersed in a semester-long program on the Brain and Computation. It’s an environment that inspires thoughts about thoughts. I begin to wonder: What would it take to build a computational model of the free-association process? Among the various challenges proposed for artificial intelligence, this one looks easy. There’s no need for deep ratiocination; what we are asked to simulate is just woolgathering or daydreaming—what the mind does when it’s out of gear and the engine is idling. It ought to be effortless, no?

For the design of such a computational model, the first idea that comes to mind (at least to my mind) is a random walk on a mathematical graph, or network. The nodes of the network are things stored in memory—ideas, facts, events—and the links are various kinds of associations between them. For example, a node labeled butterfly might have links to moth, caterpillar, monarch, and frittillary, as well as the translations mentioned in the diagram above, and perhaps some less obvious connections, such as Australian crawl, shrimp, Muhammad Ali, pellagra, throttle valve, and stage fright. The data structure for a node of the network would include a list of pointers to all of these associated nodes. The pointers could be numbered from \(1\) to \(n\); the program would generate a pseudorandom number in this range, and jump to the corresponding node, where the whole procedure would start afresh.

This algorithm captures a few basic features of free association, but it also misses quite a lot. The model assumes that all destination nodes are equally likely, which is implausible. To accommodate differences in probability, we could give each link \(i\) a weight \(w_i\), then make the probabilities proportional to the weights.

A further complication is that the weights depend on context—on one’s recent history of mental activity. If it weren’t for the combination of Mrs. Robinson and Jackie Robinson, would I have thought of Joe DiMaggio? And now, as I write this, Joltin’ Joe brings to mind Marilyn Monroe, and then Arthur Miller, and I am helpless to stop another whole train of associations. Reproducing this effect in a computer model would require some mechanism for dynamically adjusting the probabilities of entire categories of nodes, depending on which other nodes have been visited lately.

Recency effects of another kind should also be taken into account. The rubber band that repeatedly yanks me back to Simon and Garfunkel and Mrs. Robinson needs to have a place in the model. Perhaps each recently visited node should be added to the list of candidate destinations even if it is not otherwise linked to the current node. On the other hand, habituation is also a possibility: Ideas revisited too often become tiresome, and so they need to be suppressed in the model.

One final challenge: Some memories are not isolated facts or ideas but parts of a story. They have a narrative structure, with events unfolding in chronological order. Nodes for such episodic memories require a next link, and maybe a previous link, too. That chain of links holds your whole life together, to the extent you remember it.

Could a computational model like this one reproduce my mental meanderings? Gathering data for the model would be quite a chore, but that’s no surprise, since it has taken me a lifetime to fill my cranium with that jumble of herbs, Herbs, Simons, Robinsons, and Hoffmans. More worrisome than the volume of data is the fiddly nature of the graph-walking algorithm. It’s easy to say, “Pick a node according to a set of weighted probabilities,” but when I look at the gory details of how it’s done, I have a hard time imagining anything like that happening in the brain.

Here’s the simplest algorithm I know for random weighted selection. It’s not the most efficient such algorithm, but the better methods are even messier. Keith Schwarz has an excellent tutorial and review on the subject. Assume that the data structure representing a node of the network includes a list of links to other nodes and a corresponding list of weights. As shown in the diagram below, the program generates the sequence of cumulative sums of the weights: \(0, w_1, w_1 + w_2, w_1 + w_2 + w_3, \dots\). The next step is to normalize this sequence by dividing each number by the overall sum of the weights. Now we have a sequence of numbers \(p_i\) that increase monotonically from zero to one. Finally, the program selects a random real number \(x\) from the interval \([0, 1)\); ] \(x\) must lie in one of the normalized intervals \(p_i\), and this value of \(i\) identifies the next node to be selected.

Stages in the weighted random selection of a next node

In code—specifically in the Julia programming language—the node selection procedure looks like this:

function select_next(links, weights)
    total = sum(weights)
    cum_weights = cumsum(weights)
    probabilities = cum_weights / total
    x = rand()
    for i in 1:length(probabilities)
        if probabilities[i] >= x
            return i

I have slogged through these tedious details of cumulative sums and pseudorandom numbers as a way of emphasizing that the graph-walking algorithm is not as simple as it seems on first glance. And we still haven’t dealt with the matter of adjusting the probabilities on the fly, as attention drifts from topic to topic.

Even harder to fathom is the process of learning—adding new nodes and links to the network. I ended my session of free associating when I came to a question I couldn’t answer: “What’s the Russian for butterfly?” But I can answer it now. The next time I play this game, I’ll add babochka to my list of butterfly terms. In the computational model, inserting a node for babochka is easy enough, but the new node also needs to be linked to all the other butterfly nodes already present. Furthermore, babochka would introduce additional links of its own. It’s phonetically close to babushka (grandmother), one of the few Russian words in my vocabulary. The -ochka suffix is a diminutive, so it needs to be associated with French -ette and Italian -ini. The literal meaning of babochka is “little soul,” which suggests still more associations. Ultimately, learning a single new word might require a full reindexing of an entire tree of knowledge.

Let’s try a different model. Forget about the random walk on a network, with its spaghetti tangle of pointers to nodes. Instead, let’s just try to keep all similar things in the same neighborhood. In the memory banks of a digital computer, that means similar things have to be stored at nearby addresses. Here’s a hypothetical segment of memory centered on the concept dog. The nearby slots are occupied by other words, things, and categories that are likely to be evoked by the thought of dog: the obvious cat and puppy, various breeds of dogs and a few individual dogs (Skippy was the family pet when I was a kid), and some quirkier possibilities. Each item has a numeric address. The address has no intrinsic meaning, but it’s important that all the memory cells are numbered sequentially.

address content
19216805 god
19216806 the dog that didn’t bark in the night
19216807 Skippy
19216808 Lassie
19216809 canine
19216810 cat
19216811 dog
19216812 puppy
19216813 wolf
19216814 cave canem
19216815 Basset Hound
19216816 Weimaraner
19216817 dogmatic

A program for idly exploring this memory array could be quite simple. It would execute a random walk over the memory addresses, but with a bias in favor of small steps. For example, the next address to be visited might be determined by sampling from a normal distribution centered on the present location. Here’s the Julia code. (The function randn() returns a random real number drawn from the normal distribution with mean \(\mu = 0\) and standard deviation \(\sigma = 1\).)

function gaussian_ramble(addr, 𝜎)
    r = randn() * 𝜎
    return addr + round(Int, r)

The scheme has some attractive features. There’s no need to tabulate all the possible destinations as a preliminary to choosing one of them. Probabilities are not stored as numbers but are encoded by position within the array, and further modulated by the parameter 𝜎, which determines how far afield the procedure is willing to reach in the array. Although the program is still doing some arithmetic in order to sample from a normal distribution, that function could probably be in a simpler way.

But the procedure also has a dreadful defect. In surrounding dog with all of its immediate associates, we leave no room for their associates. The doggy terms are fine in their own context, but what about the cat in the list? Where do we put kitten and tiger and nine lives and Felix? In a one-dimensional array there’s no hope of embedding every memory within its own proper neighborhood.

So let’s shift to two dimensions! By splitting the addresses into two components, we can set up two orthogonal axes. The first half of each address becomes a \(y\) coordinate and the second half an \(x\) coordinate. Now dog and cat are still close neighbors, but they also have private spaces where they can play with their own friends.

2d cats and dogs

However, two dimensions aren’t enough, either. If we try to fill in all the correlatives of The Cat in the Hat, they will inevitably collide and conflict with those of the dog that didn’t bark in the night. Evidently we need more dimensions—a lot more.

Now would be a good moment for me to acknowledge that I am not the first person ever to think about how memories could be organized in the brain. A list of my predecessors might start with Plato, who compared memory to an aviary; we recognize our memories by their plumage, but sometimes we have trouble retrieving them as they flutter about in the cranial cage. The 16th-century Jesuit Matteo Ricci wrote of a “memory palace,” where we stroll through various rooms and corridors in search of treasures from the past. Modern theories of memory tend to be less colorful than these but more detailed, aiming to move beyond metaphor to mechanism. My personal favorite is a mathematical model devised in the 1980s by Pentti Kanerva, who is now at the Redwood Center for Theoretical Neuroscience here in Berkeley. He calls the idea sparse distributed memory, which I’m going to abbreviate as SDM. It makes clever use of the peculiar geometry of high-dimensional spaces.

Think of a cube in three dimensions. If the side length is taken as one unit, then the eight vertices can be labeled by vectors of three binary digits, starting with \(000\) and continuing through \(111\). At any vertex, changing a single bit of the vector takes you to a nearest-neighbor vertex. Changing two bits moves you to a next-nearest-neighbor, and flipping all three bits leads to the opposite corner of the cube—the most distant vertex.

Three-dimensional cube with vertices labeled by eight three-bit binary vectors.

The four-dimensional cube works the same way, with \(16\) vertices labeled by vectors that include all patterns of binary digits from \(0000\) through \(1111\). And indeed the description generalizes to \(N\) dimensions, where each vertex has an \(N\)-bit vector of coordinates. If we measure distance by the Manhattan metric—always moving along the edges of the cube and never taking shortcuts across a diagonal—the distance between any two vertices is simply the number of positions where the two coordinate vectors differ (also known as the Hamming distance). The usual symbol for exclusive OR is ⊕, sometimes called a bun. It reflects the interpretation of the XOR operation as binary addition modulo 2. Kanerva prefers ∗ or ⊗, on the grounds that the role of XOR in high-dimensional computing is more like multiplication than addition. I have decided to duck this controversy by adopting the symbol ⊻, an alternative notation for XOR common among logicians. It’s a modification of ∨, the symbol for inclusive OR. Conven­iently, it’s also the XOR symbol in Julia programs.Thus the unit of measure for distance is the bit, and calculating distance is a job for the binary exclusive OR operator (XOR, ⊻), which yields a value of \(1\) for pairs of bits that differ and \(0\) for pairs that are the same:

                         0 ⊻ 0 = 0
                         0 ⊻ 1 = 1
                         1 ⊻ 0 = 1
                         1 ⊻ 1 = 0

A Julia function for measuring the distance between vertices applies the XOR function to the two coordinate vectors and counts the \(1\)s in the result.

function distance(u, v)
    w = u ⊻ v
    return count_ones(w)

As \(N\) grows large, some curious properties of the \(N\)-cube come into view. Consider the \(1{,}000\)-dimensional cube, which has \(2^{1000}\) vertices. If you choose two of those vertices at random, what is the expected distance between them? Even though this is a question about distance, we can answer it without delving into any geometric details; it’s simply a matter of tallying the positions where the two binary vectors differ. For random vectors, each bit is \(0\) or \(1\) with equal probability, and so the vectors can be expected to differ at half of the bit positions. In the case of a \(1{,}000\)-bit vector, the typical distance is \(500\) bits. This outcome is not a great surprise. What does seem noteworthy is the way all the vertex-to-vertex distances cluster tightly around the mean value of 500.

Graph of frequency of each distance from 0 to 1,000 in 100 million random pairs of 1,000-bit vectors, showing strong clustering near the mean value of 500.

For \(1{,}000\)-bit vectors, almost all randomly chosen pairs lie at a distance between \(450\) and \(550\) bits. In a sample of \(100\) million random pairs (see graph above) none were closer than \(400\) bits or farther apart than \(600\) bits. Nothing about our life in low-dimensional space prepares us for this condensation of probability in the middle distance. Here on Earth, you might be able to find a place to stand where you’re all alone, and almost everyone else is several thousand miles away; however, there’s no way to arrange the planet’s population so that everyone has this experience simultaneously. But that’s the situation in \(1{,}000\)-dimensional space.

Needless to say, it’s hard to visualize a \(1{,}000\)-dimensional cube, but it’s possible to get a little intuition about the geometry from as few as five dimensions. Tabulated below are all the vertex coordinates of a five-dimensional unit cube, arranged according to their Hamming distance from the origin \(00000\). A majority of the vertices (20 out of 32) are at the middle distances of either two or three bits. The table would have the same shape if any other vertex were taken as the origin.

vertex coordinates of a five-dimensional cube, sorted by distance from the vertex 00000

A serious objection to all this talk of \(1{,}000\)-dimensional cubes is that we’ll never build one; there aren’t enough atoms in the universe for a structure with \(2^{1000}\) parts. But Kanerva points out that we need storage locations only for the items that we actually want to store. We could construct hardware for a random sample of, say, \(10^8\) vertices (each with a \(1{,}000\)-bit address) and leave the rest of the cube as a ghostly, unbuilt infrastructure. Kanerva calls the subset of vertices that exist in hardware hard locations. A set of \(10^8\) random hard locations would still exhibit the same squeezed distribution of distances as the full cube; indeed, this is precisely what the graph above shows.

The relative isolation of each vertex in the high-dimensional cube hints at one possible advantage of sparse distributed memory: A stored item has plenty of elbow room, and can spread out over a wide area without disturbing the neighbors. This is indeed one distinguishing feature of SDM, but there’s more to it.

Conventional computer memory enforces a one-to-one mapping between addresses and stored data items. The addresses are consecutive integers in a fixed range, such as \([0, 2^{64})\). Every integer in this range refers to a single, distinct location in the memory, and every location is associated with exactly one address. Also, each location holds just one value at a time; writing a new value wipes out the old one.

SDM breaks all of these rules. It has a huge address space—at least \(2^{1000}\)—but only a tiny, random fraction of those locations exist as physical entities; this is why the memory is said to be sparse. A given item of information is not stored in just one memory location; multiple copies are spread throughout a region—hence distributed. Furthermore, each individual address can hold multiple data items simultaneously. Thus information is both smeared out over a broad area and smushed together at the same site. The architecture also blurs the distinction between memory addresses and memory content; in many cases, the pattern of bits to be stored acts as its own address. Finally, the memory can respond to a partial or approximate address and find the correct item with high probability. Where the conventional memory is an “exact match machine,” SDM is a “best match machine,” retrieving the item most similar to the requested one.

In his 1988 book Kanerva gives a detailed quantitative analysis of a sparse distributed memory with \(1{,}000\) dimensions and \(1{,}000{,}000\) hard locations. The hard locations are chosen randomly from the full space of \(2^{1000}\) possible address vectors. Each hard location has room to store multiple \(1{,}000\)-bit vectors. The memory as a whole is designed to hold at least \(10{,}000\) distinct patterns. In what follows I’m going to consider this the canonical SDM model, although it is small by mammalian standards, and in his more recent work Kanerva has emphasized vectors with at least \(10{,}000\) dimensions.

Here’s how the memory works, in a simple computer implementation. The command store(X) writes the vector \(X\) into the memory, treating it as both address and content. The value \(X\) is stored in all the hard locations that lie within a certain distance of the address \(X\). For the canonical model this distance is 451 bits. It defines an “access circle” designed to encompass about \(1{,}000\) hard locations; in other words, each vector is stored in about \(1/1{,}000\)th of the million hard locations.

It’s important to note that the stored item \(X\) does not have to be chosen from among the \(1{,}000{,}000\) binary vectors that are addresses of hard locations. On the contrary, \(X\) can be any of the \(2^{1000}\) possible binary patterns.

Suppose a thousand copies of \(X\) have already been written into the SDM when a new item \(Y\) comes along, to be stored in its own set of a thousand hard locations. There might be some overlap between the two sets of locations—sites where both \(X\) and \(Y\) are stored. The later-arriving value does not overwrite or replace the earlier one; both values are retained. When the memory has been filled to its capacity of \(10{,}000\) vectors, each of them stored \(1{,}000\) times, a typical hard location will hold copies of \(10\) distinct patterns.

Now the question is: How can we make sense of this memory mélange? In particular, how can we retrieve the correct value of \(X\) without interference from \(Y\) and all the other items jumbled together in the same storage locations?

The readout algorithm makes essential use of the curious distance distribution in a high-dimensional space. Even if \(X\) and \(Y\) are nearest neighbors among the \(10{,}000\) stored patterns, they are likely to differ by 420 or 430 bits; as a result, the number of hard locations where both values are stored is quite small—typically four, five, or six. The same is true of all the other patterns overlapping \(X\). There are thousands of them, but no one interfering pattern is present in more than a handful of copies inside the access circle of \(X\).

The command fetch(X) should return the value that was earlier written by store(X). The first step in reconstructing the value is to gather up all information stored within the 451-bit access circle centered on \(X\). Because \(X\) was previously written into all of these locations, we can be sure of getting back \(1{,}000\) copies of it. We’ll also receive about \(10{,}000\) copies of other vectors, stored in locations whose access circles overlap that of \(X\). But because the overlaps are small, each of these vectors is present in only a few copies. In the aggregate, then, each of their \(1{,}000\) bits is equally likely to be a \(0\) or a \(1\). If we apply a majority-rule function to all the data gathered at each bit position, the result will be dominated by the \(1{,}000\) copies of \(X\). The probability of getting any result other than \(X\) is about \(10^{-19}\).

The bitwise majority-rule procedure is shown in more detail below, for a toy example of five data vectors of 20 bits each. The output is another vector where each bit reflects the majority of the corresponding bits in the data vectors. (If the number of data vectors is even, ties are broken by choosing \(0\) or \(1\) at random.) An alternative writing-and-reading scheme, also illustrated below, forgoes storing all the patterns individually and instead keeps a tally of the number of \(0\) and \(1\) bits at each position. A hard location has a \(1{,}000\)-bit counter, initialized to all \(0\)s. When a pattern is written into the location, each bit counter is incremented for a \(1\) or decremented for a \(0\). The readout algorithm simply examines the sign of each bit counter, returning \(1\) for positive, \(0\) for negative, and a random value when the counter bit is \(0\).

Majority rule

The two storage schemes give identical results.

From a computer-engineering point of view, this version of sparse distributed memory looks like an elaborately contrived joke. To remember \(10{,}000\) items we need a million hard locations, in which we store a thousand redundant copies of every pattern. Then, in order to retrieve just one item from memory, we harvest data on \(11{,}000\) stored patterns and apply a subtle majority-rule mechanism to unscramble them. And all we accomplish through these acrobatic maneuvers is to retrieve a vector we already had. Conventional memory works with much less fuss: Both writing and reading access a single location.

But an SDM can do things the conventional memory can’t. In particular, it can retrieve information based on a partial or approximate cue. Suppose a vector \(Z\) is a corrupted version of \(X\), where \(100\) of the \(1{,}000\) bits have been altered. Because the two vectors are similar, the command fetch(Z) will probe many of the same sites where \(X\) is stored. At a Hamming distance of 100, \(X\) and \(Z\) can be expected to share about 300 hard locations. Because of this extensive overlap, the vector returned by fetch(Z)—call it \(Z^{\prime}\)—will be closer to \(X\) than \(Z\) is. Now we can repeat the process with the command fetch(Z′), which will return a result \(Z^{\prime\prime}\) even closer \(X\). After only a few iterations the procedure reaches \(X\) itself.

Kanerva shows that this converging sequence of recursive read operations will succeed with near certainty as long as the starting pattern is not too far from the target. In other words, there is a critical radius: Any probe of the memory starting at a location inside the critical circle will almost surely converge to the center, and do so rather quickly. An attempt to recover the stored item from outside the critical circle fails, as the recursive recall process wanders away into the middle distance. Kanerva’s analysis yields a critical radius of 209 bits for the canonical SDM. In other words, if you know roughly 80 percent of the bits, you can reconstruct the whole pattern.

The illustration below traces the evolution of recursive-recall sequences using initial cues that differ from a target \(X\) by \(0, 5, 10, 15 \dots 1{,}000\) bits. In this experiment all sequences starting at a distance of \(205\) or less converged to \(X\) in fewer than \(10\) iterations (blue trails). All sequences starting at a greater initial distance wandered aimlessly through the huge open spaces of the \(1{,}000\)-dimensional cube, staying roughly 500 bits from anywhere.

Convergence or divergence of the iterative reading process

The transition from convergent to divergent trajectories is not perfectly sharp, as shown in the bad-hair-day graphic below. Here we have zoomed in to look at the fate of trajectories beginning at displacements of \(175, 176, 177, \dots 225\) bits. All trails whose starting point is within 209 bits of the target are colored blue; those starting at a greater distance are red. Most of the blue trajectories converge, quickly going to zero distance, and most of the red ones don’t. Near the critical distance, however, there are lots of exceptions.

Detail of trajectories starting near the critical distance for convergence

The graph below offers yet another view of how initial distance from the target affects the likelihood of eventually converging on the correct memory address. At a distance of \(170\) bits almost all trials succeed; at \(240\) bits almost none do. The crossover point (where success and failure are equally likely) seems to lie at about \(203\) bits, a little lower than Kanerva’s result of \(209\). This discrepancy is not a mystery. Kanerva’s calculation assumes an access circle that encompasses exactly \(1{,}000\) hard locations. My experiments include all hard locations at a distance \(r \le 451\); on average there are \(1{,}070\) such locations.

Number of iterated-recall trajectories converging to the target location, as a function of initial distance from the target

The ability to reconstruct memories from partial information is a familiar element of human experience. You notice an actor in a television show, and you realize you’ve seen him before, but you don’t remember where. After a few minutes it comes to you: He’s Mr. Bates from Downton Abbey, but without his butler suit. Then there’s the high school reunion challenge: Looking at the stout, balding gentleman across the room, can you recognize the friend you last knew as a lanky teenager in track shorts? Sometimes, filling in the blanks requires a prolonged struggle. I have written before about my own inexplicable memory blind spot for the flowering vine wisteria, which I can name only after patiently working my way through a catalogue of false scents: hydrangea, verbena, forsythia.

Could our knack for recovering memories from incomplete or noisy inputs work something like the recursive recall process with high-dimensional vectors? It’s an attractive hypothesis, but there are also reasons for caution. For one thing, the brain seems to be able to tease meaning out of much skimpier clues. I don’t need to hear four-fifths of the Fifth Symphony before I recognize it; the first four notes will do. A flash of color moving through the trees instantly brings to mind the appropriate species—cardinal, bluejay, goldfinch. A mere whiff of chalkdust transports me back to the drowsy, overheated classroom where I doodled on the desktop all afternoon. These memories are evoked by a tiny fraction of the information they represent, far less than 80 percent.

Kanerva cites another quirk of human memory that might be modeled by an SDM: the tip-of-the-tongue phenomenon, whose essence is that you know you know something, even though you can’t immediately name it. This feeling is a bit mysterious: If you can’t find what you’re looking for, how do you know it’s there? The recursive recall process of the SDM offers a possible answer. When the successive patterns retrieved from memory are getting steadily closer together, you can be reasonably sure they will converge on a target, even before they get there.

In the struggle to retrieve a stubborn fact from memory, many people find that banging on the same door repeatedly is not a wise strategy. Rather than demanding immediate answers—getting bossy with your brain—it’s often better to set the problem aside, go for a walk, maybe even take a nap; the answer may then come to you, seemingly unbidden. Can this observation be explained by the SDM model? Perhaps, at least in part. If a sequence of recalled patterns is not converging, pursuing it further is probably fruitless. Starting over from a nearby point in the memory space might lead to a better outcome. But there’s a conundrum here: How do you find a new point of departure with better prospects? You might think you could just randomly flip a few bits in the input pattern in the hope that you’ll wind up closer to the target, but this is unlikely to work. If a vector is \(250\) bits from the target, then \(750\) bits are already correct (but you don’t know which \(750\) bits); any random change has a \(3/4\) chance of moving farther away rather than closer. To make progress you need to know which way to turn, and that’s a tricky question in \(1{,}000\)-dimensional space.

One aspect of the SDM architecture that seems to match human experience is the effect of repetition or rehearsal on memory. If you repeatedly recite a poem or practice playing a piece of music, you expect to remember it more easily in the future. A computational model of memory ought to exhibit the same training effect. Conventional computer memory certainly does not: There’s no benefit to writing the same value multiple times at the same address. With an SDM, in contrast, each repetition of a pattern adds another copy to all the hard locations within the pattern’s access circle. As a result, there’s less interference from overlapping patterns, and the critical radius for recall is enlarged. The effect is dramatic: When a single extra copy of a pattern is written into the memory, the critical radius grows from about \(200\) bits to more than \(300\).

By the same token, increasing the representation of one pattern can make others harder to recover. This is a form of forgetting, as the heavily imprinted pattern crowds out its neighbors and takes over part of their territory. This effect is also dramatic in the SDM—unrealistically so. A vector stored eight or ten times seems to monopolize most of the memory; it becomes an obsession, the answer to all questions.

A notable advantage of sparse distributed memory is its resilience in the face of hardware failures or errors. I would be unhappy with my own brain if the loss of a single neuron could leave a hole in my memory, so that I could no longer recognize the letter g or remember how to tie my shoelaces. SDM does not suffer from such fragility. With a thousand copies of every stored pattern, no one site is essential. Indeed, it’s possible to wipe out all information stored in \(60\) percent of the hard locations and still get perfect recall of \(10{,}000\) stored items, assuming you supply the exact address as the cue. With partial cues, the critical radius contracts as more sites are lost. After destroying \(60\) percent of the sites, the critical radius shrinks from \(200+\) bits to about \(150\) bits. With \(80\) percent of the sites gone, memory is seriously degraded but not extinguished.

And what about woolgathering? Can we traipse idly through the meadows of sparse distributed memory, serendipitously leaping from one stored pattern to the next? I’ll return to this question.

Most of the narrative above was written several weeks ago. At the time, I was reading about various competing theories of memory, and discussing their merits with my colleagues at the Simons Institute. I wrote up my thoughts on the subject, but I held off publishing because of nagging doubts about whether I truly understood the mathematics of sparse distributed memory. I’m glad I waited.

The Brain and Computation program ended in May. The participants have scattered; I am back in New England, where sage and rosemary are small potted plants rather than burgeoning shrubs spilling over the sidewalk. My morning strolls to the Berkeley campus, a daily occasion for musing about the nature of memory and learning, have themselves become “engrams” stored somewhere in my head (though I still don’t know where to look for them).

I have not given up the quest. Since I left Berkeley I’ve continued reading on theories of memory. I’ve also been writing programs to explore Pentti Kanerva’s sparse distributed memory and his broader ideas on “hyperdimensional computing.” Even if this project fails to reveal the secrets of human memory, it is certainly teaching me something about the mathematical and computational art of navigating high-dimensional spaces.

The diagram below represents the “right” way to implement SDM, as I understand it. The central element is a crossbar matrix in which the rows correspond to the memory’s hard locations and the columns carry signals representing the individual bits of an input vector. The canonical memory has a million rows, each with a randomly assigned \(1{,}000\)-bit address, and \(1{,}000\) columns; this toy version has 20 rows and 8 columns.

A hardware implementation of SDM as a crossbar switc

The process illustrated in the diagram is the storage of a single input vector in an otherwise empty memory. The eight input bits are compared simultaneously with all \(20\) hard-location addresses. Wherever an input bit and an address bit match—\(0\) with \(0\) or \(1\) with \(1\)—we place a dot at the intersection of the column and the row. Then we count the number of dots in each row, and if the count meets or exceeds a threshold, we write the input vector into the register associated with that row (blue boxes). In the example shown, the threshold is \(5\), and \(8\) of the \(20\) addresses have at least \(5\) matches. In the \(1{,}000\)-bit memory, the threshold would be \(451\), and only about a thousandth of the registers would be selected.

The magic in this design is that all of the bit comparisons—a billion of them in the canonical model—happen concurrently. As a result, the access time for both reading and writing is independent of the number of hard locations, and can be very fast. Circuitry of this general type, known as an associative memory or content-addressable memory, has a role in certain specialized computing applications, such as triggering the particle detectors at the Large Hadron Collider and steering packets through the routers of the internet backbone. And the circuit diagram might also be plausibly mapped onto certain structures in the brain. Kanerva points out that the cerebellum looks a lot like such a matrix. The rows are flat, fanlike Purkinje cells, arranged like the pages of a book; the columns are parallel fibers threaded through the whole population of Purkinje cells. (However, the cerebellum is not the region of the mammalian brain where cognitive memory is thought to reside.)

It would be wonderful to build an SDM simulation based on this crossbar design; unfortunately, I don’t know how to do that with any computer hardware I can lay my hands on. A conventional processor offers no way to compare all the input bits with all the hard-location bits simultaneously. Instead I have to scan through a million hard locations one by one, and at each location compare a thousand pairs of bits. That’s a billion bit comparisons for every item stored into or retrieved from the memory. Add to that the time needed to write or read a million bits (a thousand copies of a \(1{,}000\)-bit vector), and we’re talking about quite a lumbering process. Here’s the code for storing a vector:

function store(v::BitVector)
    for loc in SDM
        if hamming_distance(v, loc.address) <= r
            write_to_register!(loc.register, v)

This implementation needs almost an hour to stock the memory with \(10{,}000\) remembered patterns. (The complete program, in the form of a Jupyter notebook, is available on GitHub.)

Is there a better algorithm for simulating the SDM on conventional hardware? One possible strategy avoids repeatedly searching for the set of hard locations within the access circle of a given vector; instead, when the vector is first written into the memory, the program keeps a pointer to each of the thousand-or-so locations where it is stored. On any future reference to the same vector, the program can just follow the \(1{,}000\) saved pointers rather than scanning the entire array of a million hard locations. The cost of this caching scheme is the need to store all those pointers—\(10\) million of them for the canonical SDM. Doing so is feasible, and it might be worthwhile if you only wanted to store and retrieve exact, known values. But think about what happens in response to an approximate memory probe, with the recursive recall of \(Z^{\prime}\) and \(Z^{\prime\prime}\) and \(Z^{\prime\prime\prime}\), and so on. None of those intermediate values will be found in the cache, and so the full scan of all hard locations is still needed.

Perhaps there’s a cleverer shortcut. A recent review article on “Approximate Nearest Neighbor Search in High Dimensions,” by Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn, mentions an intriguing technique called locality sensitive hashing, but I can’t quite see how to adapt it to the SDM problem.

The ability to reconstruct memories from partial cues is a tantalizingly lifelike trait in a computational model. Perhaps it might be extended to yield a plausible mechanism for wandering idly through the chambers of memory, letting one idea lead to the next.

At first I thought I knew how this might work. A pattern \(X\) stored in the SDM creates a basin of attraction around itself, where any recursive probe of the memory starting within a critical radius will converge to \(X\). Given \(10{,}000\) such attractors, I can imagine them partitioning the memory space into a matrix of separate compartments, like a high-dimensional foam of soap bubbles. The basin for each stored item occupies a distinct volume, surrounded on all sides by other basins and bumping up against them, with sharp boundaries between adjacent domains. In support of this notion, I would note that the average radius of a basin of attraction shrinks when more content is poured into the memory, as if the bubbles were being compressed by overcrowding.

This vision of what’s going on inside the SDM suggests a simple way to drift from one domain to the next: Randomly flip enough bits in a vector to take it outside the present basin of attraction and into an adjacent one, then apply the recursive recall algorithm. Repeating this procedure will generate a random walk through the set of topics stored in the memory.

The only trouble is, it doesn’t work. If you try it, you will indeed wander aimlessly in the \(1{,}000\)-dimensional lattice, but you will never find anything stored there. The entire plan is based on a faulty intuition about the geometry of the SDM. The stored vectors with their basins of attraction are not tightly packed like soap bubbles; on the contrary, they are isolated galaxies floating in a vast and vacant universe, with huge tracts of empty space between them. A few calculations show the true nature of the situation. In the canonical model the critical radius defining the basin of attraction is about \(200\). The volume of a single basin—measured as the number of vectors inside it—is

$$\sum_{k = 1}^{200} \binom{1000}{k},$$

which works out to roughly \(10^{216}\). Thus all \(10{,}000\) basins occupy a volume of \(10^{220}\). That’s a big number, but it’s still a tiny fraction of the \(1{,}000\)-dimensional cube. Among all the vertices of the cube, only \(1\) out of \(10^{80}\) lies within 200 bits of a stored pattern. You could wander forever without stumbling into one of those basins.

(Forever? Oh, all right, maybe not forever. Because the hypercube is a finite structure, any path through it must eventually become recurrent, either hitting a fixed point from which it never escapes or falling into a repeating cycle. The stored vectors are fixed points, and there are also many other fixed points that don’t correspond to any meaningful pattern. For what it’s worth, in all my experiments with SDM programs, I have yet to run into a stored pattern “by accident.”)

Hoping to salvage this failed idea, I tried a few more experiments. In one case I deliberately stored a bunch of related concepts at nearby addresses (“nearby” meaning within 200 or 300 bits). Within this cluster, perhaps I could skip blithely from point to point. But in fact the entire cluster congealed into one big basin of attraction for the central pattern, which thus became a black hole swallowing up all its companions. I also tried fiddling with the value of \(r\), the radius of the access circle for all reading and writing operations. In the canonical model \(r = 451\). I thought that writing to a slightly smaller circle or reading from a slightly larger one might allow some wiggle room for randomness in the results, but this hope was also disappointed.

All of these efforts were based on a misunderstanding of high-dimensional vector spaces. Trying to find clusters of nearby values in the hypercube is hopeless; the stored patterns are sprinkled much too sparsely throughout the volume. And deliberately creating dense clusters is pointless, because it destroys the very property that makes the system interesting—the ability to converge on a stored item from any point in the surrounding basin of attraction. If we’re going to create a daydreaming algorithm for the SDM, it will have to work some other way.

In casting about for an alternative daydreaming mechanism, we might consider smuggling some graph theory into the world of sparse distributed memory. Then we could take a step back toward the original idea of mental rambling as a random walk on a graph or network. The key to building such graphs in the SDM turns out to be a familiar tool: the exclusive OR operator.

As discussed above, the Hamming distance between two vectors is calculated by taking their bitwise XOR and then counting the \(1\)s in the result. But the XOR operation provides more information than just the distance between two vectors; it also reveals the orientation or direction of the line that joins them. Specifically, the operation \(u \veebar v\) yields a vector that lists the bits that need to be changed to transform \(u\) into \(v\) or vice versa. You might also think of the \(1\)s and \(0\)s in the XOR vector as a sequence of directions to be followed to trace a path from \(u\) to \(v\).

XOR has always been my personal favorite among the Boolean functions. It is a difference operator, but unlike subtraction, XOR is symmetric: \(u \veebar v = v \veebar u\). Furthermore, XOR is its own inverse. This concept is easy to understand with functions of a single argument: \(f(x)\) is its own inverse if \(f(f(x)) = x\), so that applying the function twice you can get back to where you started. For a two-argument function such as XOR the situation is more complicated, but it’s still true that doing the same thing twice restores the original state. Specifically, if \(u \veebar v = w\), then \(u \veebar w = v\) and \(v \veebar w = u\). The three vectors \(u\), \(v\), and \(w\) form a tiny, closed universe. You can apply the XOR operator to any pair of them and you’ll get back the third element of the set. Below is my attempt to illustrate this idea. Each square represents a \(10{,}000\)-bit vector arranged as a \(100\)-by-\(100\) tableau of light and dark pixels. The three patterns appear to be random and independent, but hovering with the mouse pointer will show that each panel is in fact the XOR of the other two. For example, in the leftmost square, each red pixel matches either a green pixel or a blue pixel, but not both.

The self-inverse property suggests a new way of organizing information in the SDM. Suppose the word butterfly and its French equivalent papillon are stored as arbitrary, random vectors. They will not be close together; the distance between them is likely to be about 500 bits. Now we compute the XOR of these vectors, butterflypapillon; the result is another vector that can also be stored in the SDM. This new vector encodes the relation English-French. Now we are equipped to translate. Given the vector for butterfly, we XOR it with the English-French vector and get papillon. The same trick works in the other direction.

This pair of words and the relation between them forms the nucleus of a semantic network. Let’s grow it a little. We can store the word caterpillar at an arbitrary address, then compute butterflycaterpillar and call this new relation adult-juvenile. What’s the French for caterpillar? It’s chenille. We add this fact to the network by storing chenille at the address caterpillarEnglish-French. Now some magic happens: If we take papillonchenille, we’ll learn that these words are connected by the relation adult-juvenile, even though we did not explicitly state that fact. It is a constraint imposed by the geometry of the construction.

Network diagram with butterfly, papillon, caterpillar, chenille as vertices of a parallelogram, and pairs of edges labeled English-French and adult-juvenile.

The graph could be extended further by adding more English-French cognates (dog-chien, horse-cheval) or more adult-juvenile pairs: (dog-puppy, tree-sapling). And there are plenty of other relations to be explored: synonyms, antonyms, siblings, cause-effect, predator-prey, and so on. There’s also a sweet way of linking a set of events into a chronological sequence, just by XORing the addresses of a node’s predecessor and successor.

The XOR method of linking concepts is a hybrid of geometry and graph theory. In ordinary mathematical graph theory, distances and directions are irrelevant; all that matters is the presence or absence of connecting edges between nodes. In the SDM, on the other hand, the edge representing a relation between nodes is a vector of definite length and orientation within the \(1{,}000\)-dimensional space. Given a node and a relation, the XOR operation “binds” that node to a specific position elsewhere in the hypercube. The resulting structure is completely rigid; you can’t move a node without changing all the relations it participates in. In the case of the butterflies and caterpillars, the configuration of four nodes is necessarily a parallelogram, with pairs of opposite sides that have the same length and orientation.

Another distinctive feature of the XOR-linked graph is that the nodes and the edges have exactly the same representation. In most computer implementations of graph-theoretical ideas, these two entities are quite different; a node might be a list of attributes, and an edge would be a pair of pointers to the nodes it connects. In the SDM, both nodes and edges are simply high-dimensional vectors. Both can be stored in the same format.

As a model of human memory, XOR binding offers the prospect of connecting any two concepts through any relation we can invent. But the scheme also has some deficiencies. Many real-world relations are asymmetric; they don’t share the self-inverse property of XOR. An XOR vector can declare that Edward and Victoria are parent and child, but it can’t tell you which is which. Worse, the XOR vector connects exactly two nodes, never more, so a parent of multiple children presents faces an awkward predicament. Another challenge is keeping all the branches of a large graph consistent with one another. You can’t just add nodes and edges willy-nilly; they must be joined to the graph in the right order. Inserting a pupal stage between the butterfly and the caterpillar would require rewiring most of the diagram, moving several nodes to new locations within the hypercube and recalculating the relation vectors that connect them, all the while taking care that each change on the English side is mirrored correctly on the French side.

Some of these issues are addressed in another XOR-based technique that Kanerva calls bundling. The idea is to create a kind of database by storing attribute-value pairs. An entry for a book might have attributes such as author, title, and publisher, each of which is paired with a corresponding value. The first step in bundling the data is to separately XOR each attribute-value pair. Then the vectors resulting from these operations are combined to form a single sum vector, using the same algorithm described above for storing multiple vectors in a hard location of the SDM. Taking the XOR of an attribute name with this combined vector will extract an approximation to the corresponding value, close enough to identify it by the recursive recall method. In experiments with the canonical model I found that a single \(1{,}000\)-bit vector could hold six or seven attribute-value pairs without much risk of confusion.

Binding and bundling are not mentioned in Kanerva’s 1988 book, but he discusses them in detail in several more recent papers. (See Further Reading, below.) He points out that with these two operations the set of high-dimensional vectors acquires the structure of an algebraic field—or at least an approximation to a field. The canonical example of a field is the set of real numbers together with the operations of addition and multiplication and their inverses. The reals form a closed set under these operations: Adding, subtracting, multiplying or dividing any two real numbers yields another real number (except for division by zero, which is always the joker in the pack). Likewise a set of binary vectors is closed under binding and bundling, except that sometimes the result extracted from a bundled vector has to be “cleaned up” by the recursive recall process in order to recover a member of the set.

Can binding and bundling offer any help when we try to devise a woolgathering algorithm? They provide some basic tools for navigating through a semantic graph, including the possibility of performing a random walk. Starting from any node of an XOR-linked graph, a random-walk algorithm chooses from among all the relations available at that node. Selecting a relation vector at random and XORing it with the address of the node leads to a different node, where the procedure can be repeated. Similarly, in bundled attribute-value pairs, a randomly selected attribute calls forth the corresponding value, which becomes the next node to explore.

But how does the algorithm know which relations or which attributes are available for choosing? The relations and attributes are represented as vectors and stored in the memory just like any other objects, but there is no obvious means of retrieving those vectors unless you already know what they are. You can’t say to the memory, “Show me all the relations.” You can only present a pattern and ask, “Is this vector present? Have you seen it or something like it?”

With a conventional computer memory, you can do a core dump: Step through all the addresses and print out the value found at each location. There’s no such procedure for a distributed memory. I learned this troubling fact the hard way. While building a computational model of the SDM, I got the pieces working well enough that I could store a few thousand randomly generated patterns in the memory. But I could not retrieve them, because I didn’t know what to ask for. The solution was to maintain a separate list, outside the SDM itself, keeping a record of everything I stored. But it seems farfetched to suppose that the brain would maintain both a memory and an index to that memory. Why not just use the index, which is so much simpler?

In view of this limitation, it seems that sparse distributed memory is equipped to serve the senses but not the imagination. It can recognize familiar patterns and store novel ones, which will then be recognized when next encountered, even from partial or corrupted cues. With binding or bundling, the memory can also keep track of relations between pairs of stored items. But whatever is put into the memory can be gotten out only by supplying a suitable cue.

The Graduate, publicity poster

When I look at the publicity poster for The Graduate, I see Dustin Hoffman, more leery than leering, regarding the stockinged leg of Anne Bancroft, who plays Mrs. Robinson. This visual stimulus excites several subsets of neurons in my cerebral cortex, corresponding to my memories of the actors, the characters, the story, the soundtrack, the year 1967. All of this brain activity might be explained by the SDM memory architecture, if we grant that subsets of neurons can be represented in some abstract way by long, random binary vectors. What’s not so readily explained is how I can summon to mind all the same sensations without having the image in front of me. How do I draw those particular long, random sequences out of the great tangle of vectors without already knowing where they are?

So ends my long ramble, on a note of doubt and disappointment. It’s hardly surprising that I have failed to get to the bottom of it all. These are deep waters.

On the very first day of the Simons brain-and-computation program, Jeff Lichtman, who is laboring to trace the wiring diagram of the mouse brain, asked whether neuroscience has yet had its Watson-Crick moment. In molecular genetics we have reached the point where we can extract a strand of DNA from a living cell and read many its messages. We can even write our own messages and put them back into an organism. The equivalent capability in neuroscience would be to examine a hunk of brain tissue and read out the information stored there—the knowledge, the memories, the world view. Maybe we could also write information directly into the brain.

Science is not even close to achieving this feat—to the great relief of many. That includes me: I don’t look forward to having my thoughts sucked out of my head through electrodes or pipettes, to be replaced with #fakenews. However, I really do want to know how the brain works.

The Simons program left me dazzled by recent progress in neuroscience, but it also revealed that some of the biggest questions remain wide open. The connectomics projects of Lichtmann and others are producing a detailed map of millions of neurons and their interconnections. New recording techniques allow us to listen in on the signals emitted by individual nerve cells and to follow waves of excitation across broad regions of the brain. We have a pretty comprehensive catalogue of neuron types, and we know a lot about their physiology and biochemistry. All this is impressive, but so are the mysteries. We can record neural signals, but for the most part we don’t know what they mean. We don’t know how information is encoded or stored in the brain. It’s rather like trying to understand the circuitry of a digital computer without knowing anything of binary arithmetic or Boolean logic.

Pentti Kanerva’s sparse distributed memory is an attempt to fill in some of these gaps. It is not the only such attempt. A better-known alternative is John Hopfield’s conception of a neural network as a dynamical system settling into an energy-minimizing attractor. The two ideas have some basic principles in common: Information is scattered across large numbers of neurons, and it is encoded in a way that would not be readily understood by an outside observer, even one with access to all the neurons and the signals passing between them. Schemes of this kind, essentially mathematical and computational, occupy a conceptual middle ground between high-level psychology and low-level neural engineering. It’s the layer where the meaning is.

Further Reading

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 79(8):2554–2558.

Kanerva, Pentti. 1988. Sparse Distributed Memory. Cambridge, Mass.: MIT Press.

Kanerva, Pentti. 1996. Binary spatter-coding of ordered K-tuples. In C. von der Malsburg, W. von Seelen, J. C. Vorbruggen and B. Sendhoff, eds. Artificial Neural Networks—ICANN 96 Proceedings, pp. 869–873. Berlin: Springer.

Kanerva, Pentti. 2000. Large patterns make great symbols: An example of learning from example. In S. Wermter and R. Sun, eds. Hybrid Neural Systems, pp. 194–203. Heidelberg: Springer. PDF

Kanerva, Pentti. 2009. Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation 1(2):139–159. PDF

Kanerva, Pentti. 2010. What we mean when we say “What’s the Dollar of Mexico?”: Prototypes and mapping in concept space. Report FS-10-08-006, AAAI Fall Symposium on Quantum Informatics for Cognitive, Social, and Semantic Processes. PDF

Kanerva, Pentti. 2014. Computing with 10,000-bit words. Fifty-second Annual Allerton Conference, University of Illinois at Urbana-Champagne, October 2014. PDF

Plate, Tony. 1995. Holographic reduced representations. IEEE Transactions on Neural Networks 6(3):623–641. PDF

Plate, Tony A. 2003. Holographic Reduced Representation: Distributed Representation of Cognitive Structure. Stanford, CA: CSLI Publications.

Rahimi, Abbas, Sohum Datta, Denis Kleyko, E. Paxon Frady, Bruno Olshausen, Pentti Kanerva, and Jan M. Rabaey. 2017. High-dimensional computing as a nanoscalable paradigm. IEEE Transactions on Circuits and Systems 64(9):2508–2521. Preprint PDF

Posted in biology, computing, neuroscience | 4 Comments

Midcentury Monuments

Three serious books lie open before me. I had a variety of reasons for checking them out of the library, although they’re all related in one way or another to current goings on here at the Simons Institute:

Three tomes

  • Donald O. Hebb’s The Organization of Behavior: A Neuropsychological Theory. This is the book that introduced a fundamental hypothesis about learning and memory, captured in the slogan “Neurons that fire together get wired together.”
  • Norbert Wiener’s Cybernetics: or, Control and Communication in the Animal and the Machine, an eccentric and wide-ranging masterpiece with a crucial chapter on “Computing Machines and the Nervous System.”
  • Claude Shannon’s The Mathematical Theory of Communication, the foundational document of information theory. (Shannon’s part of this work had appeared a year earlier in the Bell System Technical Journal; the book version includes an interpretive essay by Warren Weaver.)

When I got the three volumes home, I made a surprising discovery: They were all published at roughly the same time, in 1948 and 1949. What are the odds of that? Perhaps it means nothing—just the long arm of coincidence reaching out to tap me on the shoulder. On the other hand, maybe there was something in the air circa 1950, something that made the period unusually fertile for studies of information, communication, and computation in brains and machines.

I have done a little digging in library catalogues and Wikipedia, as well as in my own files, looking for other titles that might belong on this list of distinguished midcentury milestones.

It turns out that George Kingsley Zipf’s Human Behavior and the Principle of Least Effort was also published in 1949. (This is the one about the curious power-law distribution seen in rankings of word frequencies, city sizes, and so on.)

Gilbert Ryle’s The Concept of Mind is another 1949 title, though I’ve never read it. Also from 1949: Nicholas Metropolis and Stanislaw Ulam published the first open account of the Monte Carlo method.

Drifting forward into 1950, we find another cluster of notables. There is John Nash’s one-page paper introducing what we now call the Nash equilibrium. Elsewhere in game theory, 1950 was the debut year for prisoner’s dilemma, although Merrill Flood’s paper describing it did not appear until two years later. Richard Hamming published “Error Detecting and Error Correcting Codes” in 1950. (It’s another paper from the Bell System Technical Journal.) Finally, there’s Alan M. Turing’s famous essay on “Computing Machinery and Intelligence.”

Does the density of high-octane publications really make 1948–50 an exceptional season of intellectual history? I can’t offer any solid statistical support for that notion. In the first place, my criteria for inclusion on the list are way too vague. (“Subjects I find interesting” may be closest to the truth.) In the second place, I can’t offer any evidence that other intervals were not equally productive. As a matter of fact, in my bibliographic rummaging I came across a nexus of brilliance five years earlier:

  • Warren S. McCollough and Walter H. Pitts, “A logical calculus of the ideas immanent in nervous activity,” 1943.
  • John von Neumann and Oskar Morgenstern, Theory of Games and Economic Behavior, 1944.
  • Erwin Schrödinger, What Is Life? The Physical Aspect of the Living Cell, 1944.
  • Vannevar Bush, “As We May Think,” 1945.
  • John von Neumann, “First Draft of a Report on the EDVAC,” 1945

I acknowledge a further reason for caution when I cite 1949 as a year of special distinction. It’s my year, the year of my birth.

Posted in uncategorized | 6 Comments

Notes from JMM 2018

The annual Joint Mathematics Meeting always charges my batteries. Here are few items from this year’s gathering in San Diego.

San Diego Convention Ctr during JMM 2018

A Formal Affair

In 1994 a document called the QED Manifesto made the rounds of certain mathematical mailing lists and Usenet groups.

QED is the very tentative title of a project to build a computer system that effectively represents all important mathematical knowledge and techniques. The QED system will conform to the highest standards of mathematical rigor, including the use of strict formality in the internal representation of knowledge and the use of mechanical methods to check proofs of the correctness of all entries in the system.

The ambitions of the QED project—and its eventual failure—were front and center in a talk by Thomas Hales (University of Pittsburgh) on Formal Abstracts in Mathematics. Hales is proposing another such undertaking: A comprehensive database of theorems and other mathematical propositions, along with the axioms, assumptions, and definitions on which the theorems depend, all represented in a formal notation readable by both humans and machines. Unlike QED, however, these “formal abstracts” would not include proofs of the theorems. Excluding proofs is a huge retreat from the aims of the QED group, but Hales argues that it’s necessary to make the project feasible with current technology.

Hales has plenty of experience in this field. In 1998 he announced a proof of the Kepler conjecture—the assertion that the grocer’s stack of oranges embodies the densest possible arrangement of equal-size spheres in three-dimensional space. Hales’s proof was long and complex, so much so that it stymied the efforts of journal referees to fully check it. Hales and 21 collaborators then spent a dozen years constructing a formal, computer-mediated verification of the proof.

What’s the use of a database of mathematical assertions if it doesn’t include proofs? Hales held out several potential benefits, two of which I found particularly appealing. First, the database could answer global questions about the mathematical literature; one could ask, “How many theorems depend on the Riemann hypothesis?” Second, the formal abstracts would capture the meaning of mathematical statements, not just their surface form. A search for all mentions of the equation \(x^m - y^n = 1\) would find instances that use symbols other than \(x, y, m, n,\) or that take slightly different forms, such as \(x^m - 1 = y^n\).

Hales’s formal abstracts sound intriguing, but I have to confess to a certain level of disappointment and bafflement. All around us, triumphant machines are conquering one domain after another—chess, go, poker, Jeopardy, the driver’s seat. But not proofs, apparently.

Sperner’s Lemma

Am I the last person in the whole republic of numbers to learn that Sperner’s lemma is a discrete version of the Brouwer fixed-point theorem? Francis Su and John Stillwell clued me in.

The lemma—first stated in 1928 by the German mathematician Emanuel Sperner—seems rather narrow and specialized, but it turns up everywhere. It concerns a triangle whose vertices are assigned three distinct colors:

Sperner-colored triangle

Divide the triangle into smaller triangles, constrained by two rules. First, no edge or segment of an edge can be part of more than two triangles. Second, if a vertex of a new small triangle lies on an edge of the original main triangle, the new vertex must be given one of the two colors found at the end points of that main edge. For example, a vertex along the red-green edge on the left side of the main triangle must be either red or green. Vertices strictly inside the main triangle can be given any of the three colors, without restriction.

Triangulated triangle, with each vertex colored red, blue, or green.

The lemma states that at least one interior triangle must have a full complement of red, green, and blue vertices. Actually, the lemma’s claim is slightly stronger: The number of trichromatic inner triangles must be odd. In the augmented diagram below, adding a single new red vertex has created two more RGB triangles, for a total of three.

sperner-colored triangle with three trichormatic interior triangles

Su gave a quick proof of the lemma. Consider the set of all edge segments that have one red and one green endpoint. On the exterior boundary of the large triangle, such segments can appear only along the red-green edge, and there must be an odd number of them. Now draw a path that enters the large triangle from the outside, that crosses only red-green segments, and that crosses each such segment at most once.

Sperner triangle with paths

One possible fate of this RG path is to enter through one red-green segment and exit through another. But since the number of red-green segments on the boundary is odd, there must be at least one path that enters the large triangle and never exits. The only way it can become trapped is to enter a red-green-blue triangle. (There’s nothing special about red-green segments, so this argument also holds for paths crossing red-blue and blue-green segments.)

So much for Sperner’s lemma. What do these nested triangles have to do with the Brouwer fixed-point theorem? That theorem operates in a continuous domain, which seems remote from the discrete network of Sperner’s triangulated triangle.

As the story goes (I can’t vouch for its provenance), L. E. J. Brouwer formulated his theorem at the breakfast table. Stirring his coffee, he noticed that there always seemed to be at least one stationary point on the surface of the moving liquid. He was able to prove this fact not just for the interior of a coffee cup but for any bounded, closed, and convex region, and not just for circular motion but for any continuous function that maps points within such a region to points in the same region. For each such function \(f\), there is a point \(p\) such that \(f(p) = p\).

Brouwer’s fixed-point theorem was a landmark in the development of topology, and yet Brouwer himself later renounced the theorem—or at least his proof of it, because the proof was nonconstructive: It gave no procedure for finding or identifying the fixed point. John Stillwell argues that a proof based on Sperner’s lemma comes as close as possible to a constructive proof, though it would still have left Brouwer unsatisfied.

The proof relies on the same kind of paths represented by yellow arrows in the diagram above. At least one such path comes to an end inside a tri-colored triangle, which Sperner’s lemma shows must exist in any properly colored triangulated network. If we continue subdividing the triangles under the Sperner rules, and proceed to the limit where the edge lengths go to zero, then the path ends at a single, stationary point. (It’s the “proceed to the limit” step that Brouwer would not have liked.)

The Muffin Man

You have five muffins to share among three students; lets call the students April, May, and June. One solution is to give each student one whole muffin, then divide the remaining two muffins into pieces of size one-third and two-thirds. Then the portions are divvied up as follows:

Five muffins allocated to three students with slices of size 1/3 and 2/3

This allotment is quantitatively fair, in that each student receives five-thirds of a muffin, but June complains that her two small pieces are less appetizing than the others’ larger ones. She feels she’s been given leftover crumbs. Hence the division is not envy-free.

There are surely many ways of addressing this complaint. You might cut all the muffins into pieces of size one-third, and give each student five equal pieces. Or you might give each student a muffin and a half, then eat the leftover half yourself. These are practical and sensible strategies, but they are not what Bill Gasarch was seeking when he gave a talk on the problem Saturday afternoon. Gasarch asked a specific question: What is the maximum size of the minimum piece? Can we do better than one-third?

The answer is yes. Here is a division that cuts one muffin in half and divides each of the other four muffins into portions of size seven-twelfths and five-twelfths. April and May each get \(\frac{1}{2} + \frac{7}{12} + \frac{7}{12}\); June gets \(4 \times \frac{5}{12}\).

a fair division of five muffins into three portions with smallest piece 5/12.

Five-twelfths is larger than one-third, and thus should seem less crumby. Indeed, Gasarch and his colleagues have proved five-twelfths is the best result possible: It is the maximum of the minimum. (Nevertheless, I worry that June may still be unhappy. Her portion is cut up into four pieces, whereas the others get three pieces each; furthermore, all of June’s pieces are smaller than April’s and May’s. Again, however, these concerns lie outside the scope of the mathematical problem.)

A key observation is that the smallest piece can never be larger than one-half. This is thunderously obvious once you know it, but I failed to see it when I first started thinking about the problem.

Fair-division problems have a long history (going back at least as far as the Talmud), and cake-cutting versions have been proliferating for decades. A 1961 article by L. E. Dubins and E. H. Spanier (American Mathematical Monthly 68:1–17) inspired much further work. William GasarchThere are even connections with Sperner’s lemma. Nevertheless, the genre is not exhausted yet; the muffin problem seems to be a new wrinkle. Gasarch and six co-authors (three of them high school students) have prepared a 166-page manuscript describing a year’s worth of labor on the problem, with optimal results for all instances with up to six students (and any number of muffins), as well as upper and lower bounds on solutions to larger instances, and various conjectures on open problems.

Long-time readers of bit-player may remember that Gasarch has been mentioned here before. Back in 2009 he offered (and eventually paid) \($17^2\) for a four-coloring of a 17-by-17 lattice such that no four lattice points forming a rectangle all have the same color. That problem attracted considerable attention both here and on Gasarch’s own Computational Complexity blog (conducted jointly with Lance Fortnow).

Note: In the comments Jim Propp points out that the muffin problem was invented by Alan Frank. The omission of this fact is my fault; Gasarch mentions it in his paper. The problem’s first appearance in print seems to be in a New York Times Numberplay column by Gary Antonick. Frank’s priority is acknowledged only in a footnote, which seems unfair. I apologize for again giving him credit only as an afterthought.

Posted in mathematics, problems and puzzles | 3 Comments

Flipping Wyoming

state border signs for the dozen states from MA to CA on I-80 (there is no "wlecome to Nebraska" sign, so I made do with "welcome to Omaha"

Last week I spent five days in the driver’s seat, crossing the country from east to west, mostly on Interstate 80. I’ve made the trip before, though never on this route. In particular, the 900-mile stretch from Lincoln, Nebraska, across the southern tier of Wyoming, and down to Salt Lake City was new to me.

Driving is a task that engages only a part of one’s neural network, so the rest of the mind is free to wander. On this occasion my thoughts took a political turn. After all, I was boring through the bright red heart of America. Especially in Wyoming.

Based on the party affiliations of registered voters, Wyoming is far and away the most Republican state in the union, with the party claiming the allegiance of two-thirds of the electorate. The Democrats have 18 percent. A 2013 Gallup poll identified Wyoming as the most “conservative” state, with just over half those surveyed preferring that label to “moderate” or “liberal.”

The other singular distinction of Wyoming is that it has the smallest population of all the states, estimated at 579,000. The entire state has fewer people than many U.S. cities, including Albuquerque, Milwaukee, and Baltimore. The population density is a little under six people per square mile.

I looked up these numbers while staying the night in Laramie, the state’s college town, and I was mulling them over as I continued west the next morning, climbing through miles of rolling grassland and sagebrush with scarcely any sign of human habitation. A mischievous thought came upon me. What would it take to flip Wyoming? If we could somehow induce 125,000 liberal voters to take up legal residence here, the state would change sides. We’d have two more Democrats in the Senate, and one more in the House. Berkeley, California, my destination on this road trip, has a population of about 120,000. Maybe we could persuade everyone in Berkeley to give up Chez Panisse and Moe’s Books, and build a new People’s Republic somewhere on Wyoming’s Medicine Bow River.

Let me quickly interject: This is a daydream, or maybe a nightmare, and not a serious proposal. Colonizing Wyoming for political purposes would not be a happy experience for either the immigrants or the natives. The scheme belongs in the same category as a plan announced by a former Mormon bishop to build a new city of a million people in Vermont. (Vermont has a population of about 624,000, the second smallest among U.S. states.)

Rather than trying to flip Wyoming, maybe one should try to fix it. Why is it the least populated state, and the most Republican? Why is so much of the landscape vacant? Why aren’t entrepreneurs with dreams of cryptocurrency fortunes flocking to Cheyenne or Casper with their plans for startup companies?

The experience of driving through the state on I-80 suggests some answers to these questions. I found myself wondering how even the existing population of a few hundred thousand manages to sustain itself. Wikipedia says there’s some agriculture in the state (beef, hay, sugar beets), but I saw little evidence of it. There’s tourism, but that’s mostly in the northwest corner, focused on Yellowstone and Grand Teton national parks and the cowboy-chic enclave of Jackson Hole. The only conspicuous economic activity along the I-80 corridor is connected with the mining and energy industries. My very first experience of Wyoming was olfactory: Coming downhill from Pine Bluffs, Nebraska, I caught of whiff of the Frontier oil refinery in Cheyenne; as I got closer to town, I watched the sun set behind a low-hanging purple haze that might also be refinery-related. The next day, halfway across the state, the Sinclair refinery announced itself in a similar way.

Sinclair refinery in Sinclair, Wyoming

Still farther west, coal takes over where oil leaves off. The Jim Bridger power plant, whose stacks and cooling-tower plumes are visible from the highway, burns locally mined coal and exports the electricity.

Jim Bridger power plant 5582

As the author of a book celebrating industrial artifacts, I’m hardly the one to gripe about the presence of such infrastructure. On the other hand, oil and coal are not much of a foundation for a modern economy. Even with all the wells, the pipelines, the refineries, the mines, and the power plants, Wyoming employment in the “extractive” sector is only about 24,000 (or 7 percent of the state’s workforce), down sharply from a peak of 39,000 in 2008. If this is the industry that will build the state’s future, then the future looks bleak.

Economists going all the way back to Adam Smith have puzzled over the question: Why do some places prosper while others languish? Why, for example, are Denver and Boulder so much livelier than Cheyenne and Laramie? The Colorado cities and the Wyoming ones are only about 100 miles apart, and they share similar histories and physical environments. But Denver is booming, with a diverse and growing economy and a population approaching 700,000—greater than the entire state of Wyoming. Cheyenne remains a tenth the size of Denver, and in Cheyenne you don’t have to fight off hordes of hipsters to book a table for dinner. What makes the difference? I suspect the answer lies in a Yogi Berra phenomenon. Everybody wants to go to Denver because everyone is there already. Nobody wants to be in Cheyenne because it’s so lonely. If this guess is correct, maybe we’d be doing Wyoming a favor by bringing in that invasion of 125,000 sandal-and-hoodie–clad bicoastals.

sign at the continental divide, elevation 7000One more Wyoming story. At the midpoint of my journey across the state, near milepost 205 on I-80, I passed the sign shown at left. I am an aficionado of continental divide crossings, and so I took particular note. Then, 50 miles farther along, I passed another sign, shown at right. continental divide sign at elevation 6930 On seeing this second crossing, I put myself on high alert for a third such sign. This is a matter of simple topology, or so I thought. If a line—perhaps a very wiggly one—divides an area into two regions, then if you start in one region and end up in the other, you must have crossed the line an odd number of times. Shown below are some possible configurations. three possible ways of crossing a wiggly continential divideIn each case the red line is the path of the continental divide, and the dashed blue line is the road’s trajectory across it. At far left the situation is simple: The road intersects the divide in a single point. The middle diagram shows three crossings; it’s easy to see how further elaboration of the meandering path could yield five or seven or any odd number of crossings. An arrangement that might seem to generate just two crossings is show at right. One of the “crossings” is not a crossing at all but a point of tangency. Depending on your taste in such matters, the tangent intersection could be counted as crossing the divide twice or not at all; in either case, the total number of crossings remains odd.

In the remainder of my trip I never saw a sign marking a third crossing of the divide. The explanation has nothing to do with points of tangency. I should have known that, because I’ve actually written about this peculiarity of Wyoming topography before. Can you guess what’s happening? Wikipedia tells all.

Posted in mathematics, modern life, social science, technology | 1 Comment

Sir Roger Penrose’s Toilet Paper

Penrose Tiling Rhombi Wikimedia
Twenty years ago, Kimberly-Clark, the Kleenex company, introduced a line of toilet paper embossed with the kite-and-dart aperiodic tiling discovered by Roger Penrose. When I first heard about this, I thought: How clever. Because the pattern never repeats, the creases in successive layers of a roll would never line up over any extended region, and so the sheets would be less likely to stick together.

Sir Roger Penrose had a different response. Apparently be believes the pattern is subject to copyright protection, and he also managed to get a patent issued in 1979, although that would have expired about the time of the toilet paper scandal. Penrose assigned his rights to a British company called Pentaplex Ltd. An article in the Times of London quoted a representative of Pentaplex:

So often we read of very large companies riding roughshod over small businesses or individuals, but when it comes to the population of Great Britain being invited by a multinational [company] to wipe their bottoms on what appears to be the work of a knight of the realm without his permission, then a last stand must be made.

Sir Roger sued. I haven’t been able to find a documented account of how the legal action was resolved, but it seems Kimberly-Clark quickly withdrew the product.

Some years ago I was given a small sample of the infamous Penrose toilet paper. It came to me from Phil and Phylis Morrison; a note from Phylis indicates that they acquired it from Marion Walter. Now I would like to pass this treasure on to a new custodian. The specimen is unused though not pristine, roughly a foot long, and accompanied by a photocopy of the abovementioned Times news item. In the photograph below I have boosted the contrast to make the raised ridges more visible; in real life the pattern is subtle.

Penrose tiled toilet paper  enhanced

Are you interested in artifacts with unusual symmetries? Would you like to add this object to your collection? Send a note with a U.S. mailing address to If I get multiple requests, I’ll figure out some Solomonic procedure for choosing the recipient(s). If there are no takers, I guess I’ll use it for its intended purpose.

I must also note that my hypothesis about the special non-nesting property of the embossed paper is totally bogus. In the first place, a roll of toilet paper is an Archimedian spiral, so that the circumference increases from one layer to the next; even a perfectly regular pattern will come into coincidence with itself only when the circumference equals an integer multiple of the pattern period. Second, the texture imprinted on the toilet paper is surely not a real aperiodic tiling. The manufacturing process would have involved passing the sheet between a pair of steel crimping cylinders bearing the incised network of kites and darts. Those cylinders are necessarily of finite diameter, and so the pattern must in fact repeat. If Kimberly-Clark had contested the law suit, they might have used that point in their defense.

Posted in mathematics, off-topic, uncategorized | 1 Comment