# A neural data science: how and why

## The rough guide to doing data science on neurons

Quietly, stealthily, a new type of neuroscientist is taking shape. From within the myriad ranks of theorists have risen teams of neuroscientists that do science with data on neural activity, on the sparse splutterings of hundreds of neurons. Not the creation of methods for analysing data, though all do that too. Not the gathering of that data, for that requires another, formidable, skill set. But neuroscientists using the full gamut of modern computational techniques on that data to answer scientific questions about the brain. A neural data science has emerged.

Turns out I’m one of them, this clan of neural data scientists. Accidentally. As far as I can tell, that’s how all scientific fields are born: accidentally. Researchers follow their noses, start doing new things, and suddenly find there’s a small crowd of them in the kitchen at parties (because it’s where the drinks are, in the fridge — scientists are smart). So here’s a little manifesto for neural data science: why it’s emerging, and how we might set about doing it.

The why is the same as all areas of science that have spat out a data science: the amount of data is getting out of hand. For the science of recording lots of neurons, this data deluge has a scientific rationale, of sorts. Brains work by passing messages between neurons. Most of those messages take the form of tiny pulses of electricity: spikes, we call them. So to many it seems logical that if we want to understand how brains work (and when they don’t work) we need to capture all the messages being passed between all the neurons. And that means recording as many spikes from as many neurons as possible.

A baby zebrafish brain has about 130,000 neurons, and at least 1 million connections between them; a bumblebee brain has about one million neurons. You can see how this would get out of hand very quickly. Right now we record somewhere between tens to a few hundred neurons at the same time with standard kit. At the limits are people recording a few thousand, and even a few getting tens of thousands (albeit these recordings capture neuron activity at rates much slower than the neurons could be sending their spikes).

We call this madness systems neuroscience: neuroscience, for the study of neurons; systems, for daring to record from more than one neuron at a time. And the data are mind-bendingly complex. What we have are tens to thousands of simultaneously recorded time-series, each the stream of spiking events (actual spikes, or some indirect measure thereof) from one neuron. By definition, they are not stationary, their statistics change over time. Their rates of activity spread over many orders of magnitude, from monk-like quiet contemplation to “drum kit in a wind tunnel”. And their patterns of activity range from clock-like regularity, to stuttering and spluttering, to alternating between bouts of mania and bouts of exhaustion.

Now marry that to the behaviour of the animal you’ve recorded the neurons from. This behaviour is hundreds of trials of choices; or arm movements; or routes taken through an environment. Or the movement of a sense organ, or the entire posture of the musculature. Repeat for multiple animals. Possibly multiple brain regions. And sometimes whole brains.

We have no ground-truth. There isn’t a right answer; there are no training labels for the data, except the behaviour. We don’t know how brains encode behaviour. So we can do things with behavioural labels, but we almost always know these are not the answer. They are just clues to the “answer”.

Systems neuroscience is then a rich playground for those who can marry their knowledge of neuroscience to their know-how for analysing data. A neural data science is being born.

How is it — or could it be — done? Here’s a rough guide. The raison d’etre of the neural data scientist is to ask scientific questions of data from systems neuroscience; to ask: how do all these neurons work together to do their thing?

There are roughly three ways we can answer that question. We can see these three ways by looking at the correspondence between established classes of problems in machine learning and computational challenges in systems neuroscience. Let’s start by looking at what we have to work with.

We have some data from *n* neurons that we’ve collected over time. We’ll lump these into a matrix we’ll call *X* — as many columns as neurons, and as many rows as time-points we’ve recorded (where it’s up to us how long a “time-point” lasts: we might make it short, and just have each entry record a 1 for a spike, and 0 otherwise. Or we might make it long, and each entry records the number of spikes during that elapsed time). Over that time, stuff has been happening in the world — including what the body has been doing. So let’s lump all that into a matrix we’ll call *S *— as many columns as there are features in the world we care about, and as many rows as time-points we’ve recorded for those features.

Traditionally, machine learning involves building three classes of models about the state of the world and the available data: generative, discriminative, and density. As a rough guide, this table shows how each class corresponds to a fundamental question in systems neuroscience:

**1/ Density models P(X): is there structure in the spikes? **Sounds dull. But actually this is key to great swathes of neuroscience research, in which we want to know the effect of something (a drug, a behaviour, a sleep) on the brain; in which we are asking: how has the structure of neural activity changed?

With a recording of a bunch of neurons, we can answer this in three ways.

First, we can quantify the spike-train of each neuron, by measuring the statistics of each column of *X*, like the rate of spiking. And then ask: what is the model *P(X) *for these statistics? We can cluster these statistics to find “types” of neuron; or simply fit models to their entire joint distribution. Either way, we have some model of the data structure at the granularity of single neurons.

Second, we can create generative models of the entire population’s activity, using the rows of *X — *the vectors of the moment-to-moment activity of the whole population. Such models typically aim to understand how much of the structure of *X* can be recreated from just a few constraints, whether they be the distribution of how many vectors have how many spikes; or the pairwise correlations between neurons; or combinations thereof. These are particularly useful for working out if there is any special sauce in the population’s activity, if it is anything more than the collective activity of a set of independent or boringly simple neurons.

Third, we can take the position that the neural activity in *X* is some high dimensional realisation of a low dimensional space, where the number of dimensions *D << n.* Typically we mean by this: some neurons in *X* are correlated, so we don’t need to use the whole of *X* to understand the population — instead we can replace them with a much simpler representation. We might cluster the time-series directly, so decomposing *X* into a set of *N* smaller matrices *X_1* to *X_N*, each of which has (relatively) strong correlations within it, and so can be treated independently. Or we might use some kind of dimension reduction approach like Principal Components Analysis, to get a small set of time-series that each describes one dominant form of variation in the population’s activity over time.

We can do more than this. The above assumes we want to use dimension reduction to collapse neurons — that we apply reduction to the columns of *X*. But we could just as easily collapse time, by applying dimension reduction to the rows of *X*. Rather than asking if neural activity is redundant, this is asking if different moments in time have similar patterns of neural activity. If there are only a handful of these, clearly the dynamics of the recorded neurons are very simple.

We can throw in dynamical systems approaches here too. Here we try to fit simple models to the changes in *X *over time (i.e. mapping from one row to the next), and use those models to quantify the types of dynamics *X* contains — using terms like “attractor”, “separatrix”, “saddle node”, “pitchfork bifurcation”, and “the Arsenal collapse” (only one of those is not a real thing). One might plausibly argue the dynamical models so fitted are all density models *P(X)*, as they describe the structure of the data.

Hell, we could even try and fit an entire dynamical model of a neural circuit, a bunch of differential equations describing each neuron, to *X*, so that our model *P(X)* is then sampled every time we run the model from different initial conditions.

With these density models, we can fit them separately to the neural activity we recorded in a bunch of different states (*S1*, *S2*,…, *Sm*), and answer questions like: how does the structure of a population of neurons change between sleeping and waking? Or during development of the animal? Or in the course of learning a task (where *S1* might be trial 1, and *S2* trial 2; or *S1* is session 1 and *S2* session 2; or many combinations thereof). We can also ask: how many dimensions does neuron activity span? Are the dimensions different between different regions of cortex? And has anyone seen my keys?

**2/ Generative models P(X|S): what causes a spike?** Now we’re talking. Things like linear-nonlinear models, or generalised linear models. Typically these models are applied to single neurons, to each column of *X*. With them we are fitting a model that uses the state of the world *S* as input, and spits out a neural activity series that matches the neuron’s activity as closely as possible. By then inspecting the weighting given to each feature of *S* in reproducing the neuron’s activity, we can work out what that neuron appears to give a damn about.

We might want to chose a model that has some flexibility in what counts as “the state of the world”. We can include the neuron’s own past activity as a feature, and see if it cares about what it did in the past. For some types of neuron, the answer is yes. Bursting can take a lot out of a neuron, and it needs to lie down for a wee rest before it can go again. We can also think more broadly, and include the rest of the population — the rest of *X* — as part of the state of the world *S* while the neuron is firing. After all, neurons occasionally influence each others’ firing, or so I’m led to believe. So there is a tiny chance that the response of a neuron in visual cortex is not **just** driven by the orientation of an edge in the outside world, but may also depend on what the 10000 cortical neurons connecting to it are also doing. What we then learn is the approximately most influential neurons in the population.

We don’t have to apply these generative models to single neurons. We can equally apply them to our density models; we can ask what each cluster, or dimension, is encoding about the world. Or, as some people did here, we can use the density model itself as the state of the world, and ask what features of that model downstream neurons give a damn about.

The types of questions we can answer with these generative models are pretty obvious: what combination of features best predict a neuron’s response? Are there neurons selective for just one thing? How do neurons influence each other?

**3/ Discriminative models P(S|X): what information do spikes carry?** This is a core question in systems neuroscience as it is the challenge faced by all neurons that are downstream from our recorded population — all neurons that receive inputs from the neurons we recorded from and stuffed in our matrix

*X*. For those downstream neurons must infer what they need to know about the external world based solely on spikes.

Here we can use standard classifiers, that map inputs onto labelled outputs. We can use the rows of *X* as input, each a snapshot of the population’s activity, and attempt to predict one, some, or all of the features in the corresponding rows of *S*. Possibly with some time delay, so we use row *X_t* to predict the state *S_t-n *that was *n* steps in the past if we’re interested in how populations code states that are input to the brain; or we can use row *X_t* to predict the state *S_t+n *that is *n* steps in the future if we’re interested in how populations code for some effect of the brain on the world. Like the activity in motor cortex that is happening before I type each letter right now.

Either way, we take some (but not all, for we do not overfit) rows of *X*, and train the classifier to find the best possible mapping of *X *to the corresponding chunk of *S*. Then we test the classifier on how well it can predict the rest of *S* from the corresponding rest of *X*. If you’re extraordinarily lucky, your *X* and *S* could be so long that you’re able to divide them into train, test, and validate sets. Keep the last in a locked box.

We could of course use as powerful a classifier as we like. From logistic regression, through Bayesian approaches, to using a 23 layer neural network. It rather depends on what you want out of the answer, and the trade-off between interpretability and power you’re comfortable with. My writings elsewhere have made it plain which side of this trade-off I tend to favour. But I’m happy to be proved wrong.

Encoding models of neurons are insightful, but touch on some old and deep philosophical quandaries. Testing encoding using a discriminative model assumes that something downstream is trying to decode *S *from neural activity. There are two problems with this. Neurons do not decode; neurons take in spikes as input and output their own spikes. Rather, they **re-code**, from one set of spikes into another set of spikes: perhaps fewer, or slower; perhaps more, or faster; perhaps from a steady stream into an oscillation. So discriminative models are more accurately asking what information our neurons are re-coding. But even if we take this view, there’s a deeper problem.

With very few exceptions, there is no such thing as a “downstream” neuron. The neurons we recorded in *X* are part of the intricately wired brain, full of endless loops; their output affects their own input. Worse, some of the neurons in *X* are downstream from the others: some of them input directly to the others. Because, as noted above, neurons influence each other.

A rough, perhaps useful, manifesto for a neural data science. It is incomplete; no doubt something above is wrong (answers on a postcard to the usual address). The above is an attempt to synthesise the work of a group of labs with very disparate interests, but a common drive to use these kinds of model on big sets of neural data to answer deep questions about how brains work. Many of these are data labs, teams that analyse experimental data to answer their own questions; to name a few — Johnathan Pillow; Christian Machens; Konrad Kording; Kanaka Rajan; John Cunningham; Adrienne Fairhall; Philip Berens; Cian O’Donnell; Il Memming Park; Jakob Macke; Gasper Tkacik; Oliver Marre. Um, me. Others are experimental labs with strong data science inclinations: Anne Churchland; Mark Churchland; Nicole Rust; Krishna Shenoy; Carlos Brody; many others I apologise for not naming.

There are conferences where this kind of work is welcomed, nay even encouraged. A journal for neural data science is on its way. Something is building. Come on in, the data’s lovely*.

* yeah I had to refer to data as a singular to get that crap joke to work. The fact I’m writing this footnote to explain this will give you some idea of the fastidious attention to detail neural data scientists expect.

*Want more? Follow us at **The Spike*

*Twitter: @**markdhumphries*