Combining biophysical modeling and deep learning for multielectrode array neuron localization and classification X

X Alessio P. Buccino,* Michael Kordovan,* Torbjørn V. Ness, Benjamin Merkt, Philipp D. Häfliger, Marianne Fyhn, Gert Cauwenberghs, Stefan Rotter,* and X Gaute T. Einevoll* Center for Integrative Neuroplasticity (CINPLA), Faculty of Mathematics and Natural Sciences, University of Oslo, Oslo, Norway; Department of Bioengineering, University of California, San Diego, California; Bernstein Center Freiburg, Freiburg, Germany; Faculty of Biology, University of Freiburg, Freiburg, Germany; and Faculty of Science and Technology, Norwegian University of Life Sciences, Ås, Norway


INTRODUCTION
The neural circuits of the brain involve the interplay of many different types of neurons.On the most superficial level, neurons are classified by their effect on the neurons they project to as either excitatory or inhibitory.Extracellular recordings enable us to measure the activity of neurons as electrical potential deflections mainly due to ionic transmembrane currents.In recent years, many groups have been developing high-density multielectrode arrays (MEAs) for in vitro and in vivo applications, which allow measurements of neuronal activity with high spatiotemporal resolution (Berdondini et al. 2014;Müller et al. 2015;Neto et al. 2016;Schröder et al. 2015;Welkenhuysen et al. 2016).These probes provide something close to a functional electrical imaging (Vassanelli 2014) of the neural activity, and this high information density can potentially be exploited to obtain a better understanding of the neural circuits under study.Specifically, it might be possible to extract information that could be used to localize the individual neurons and to classify them into their respective cell types.The latest developments in manufacturing of high-density neural probes call for novel analysis methods to exploit the richness and detail in the recordings.
Neural localization from extracellular action potentials (EAPs) recorded on a MEA is an ill-posed problem, since solutions might not be unique for complex neural morphologies.Current approaches for localization assume simple neuronal models to facilitate the inverse problem and make the solution unique.Examples of models used in previous studies are monopole current-source models (Blanche et al. 2005;Chelaru and Jog 2005;Kubo et al. 2008), dipole current-source models (Blanche et al. 2005;Mechler et al. 2011;Mechler and Victor 2012), as well as line-source models (Somogyvári et al. 2012).More recently, Delgado Ruz and Schultz (2014) used a modified ball-and-stick model (Pettersen and Einevoll 2008) to predict somatic locations.In these approaches, the soma posi-tion is estimated by minimizing the error between the electrical potential on the MEA sites predicted by the chosen model and the recorded potential.However, it is experimentally challenging to measure the correct position of the soma (Neto et al. 2016); therefore, detailed computational neuronal models are usually used and treated as simulated ground truth to evaluate the accuracy of the localization methods (Delgado Ruz and Schultz 2014; Somogyvári et al. 2005Somogyvári et al. , 2012)).The main limitations regarding neuron localization are that the models chosen to solve the inverse problem are often too simple to grasp complex spike waveforms (e.g., monopolar or bipolar currentsource models) or are tuned to specific cell types (ball-andstick model for pyramidal morphology).
Regarding classification of neurons, unsupervised clustering techniques are commonly applied to differentiate EAP shapes (Barthó et al. 2004;McCormick et al. 1985;Peyrache et al. 2012): narrow waveforms are considered to be fast-spiking inhibitory neurons and broad waveforms excitatory neurons.Also in this case, it is experimentally challenging, especially in vivo, to validate the classification methods.One approach is to measure a multitude of neurons and find putative monosynaptic connections based on the shape of spike train cross-correlograms.However, the rate of recorded monosynaptic connections is usually very low (ϳ0.2%;Barthó et al. 2004;Peyrache et al. 2012), resulting in a small number of observations useful for validation.In neural classification, the complexity of spike shapes across the multiple recording sites is usually compressed by extracting spike widths (such as peak-to-peak and full-width half-maximum widths; Barthó et al. 2004;Peyrache et al. 2012) only from the electrode with the highest recorded amplitude.
In this study, we apply a powerful machine learning technique, namely, convolutional neural networks (CNNs), to clas-sify excitatory and inhibitory neurons and to localize their somata from simulated EAPs.This approach-being a supervised learning algorithm-demands for a large amount of labeled data, in this case EAPs in combination with soma position and cell type of the neuron evoking the EAPs.The proposed method is schematically depicted in Fig. 1.First, compartmental cell simulations are performed (Fig. 1A) that yield EAP data sets with known simulated ground truth (forward modeling) (Fig. 1B).Relying on the simulations, CNNs are trained (Fig. 1C) on these data sets to predict position and cell type (Fig. 1D) of the neuron generating the simulated EAP.If the method is applied to experimental data (Fig. 1E), a spike-triggered average EAP (average waveform) serves as input to a CNN previously trained on simulated data to predict soma position and cell type.In addition to binary classification, we attempt to distinguish 11 different morphological types (m-type classification).The performance of the CNNs depending on different characteristics extracted from the EAP, different MEA designs, and different CNN configurations is explored.Finally, we evaluate the effect of varying neuron alignments with respect to the recording MEA.To put our approach into context, we compare its performance with established methods of cell localization and classification.
CNNs perform supervised machine learning and require a large data set to be trained.It would be experimentally challenging, if not unfeasible, to gather the required number of recordings of exact known position (used for localization) and morphology (used for classification) information.Therefore, we rely on detailed compartmental cell models to provide detailed simulated recordings and simulated ground-truth information.Forward biophysical modeling of extracellular potentials has been developed and refined over the last 30 years (Gold et al. 2006;Holt and Koch 1999;Lindén et al. 2014;Fig. 1. Graphical representation of the presented method.Red arrows show our approach for training (dashed lines indicate error backpropagation) and validating the convolutional neural network (CNN) on simulated data.Green arrow shows how the method is applied within an experimental pipeline.Starting with the red path, biophysical simulations (A) are used to generate extracellular action potential (EAP) templates (B), from which features (e.g., amplitude and width; see Classification and Localization Features) are extracted and fed to a CNN (C) to localize and classify excitatory (blue) and inhibitory (red) neurons (D).When applied to experimental data (green arrow), recordings are first spike sorted (E), then features are extracted from spike-triggered average waveforms (B), and finally the CNN trained on simulated data (C) is used to localize and classify spike-sorted neurons (D).Pettersen and Einevoll 2008;Rall and Shepherd 1968), and it is a still growing field.Therefore, we expect that computer simulations will become closer to real recordings as the detail and fidelity of neural models increase.
The aim of this study was to investigate the capability and feasibility of the proposed deep learning approach and to explore the large parameter space to guide future developments and experiments.At this stage, the method is a proof of concept, as we used some simplifications along the way.However, valuable information can be obtained and used in the future.Simulation software scripts are available at https:// github.com/CINPLA/NeuroCNN.

Cell Models
We first generated simulated spike recordings from various neuronal cell types (Fig. 1A).The neuronal models have been adopted from the Neocortical Microcircuit Collaboration (NMC) Portal (Markram et al. 2015;Ramaswamy et al. 2015, https://bbp.epfl.ch/nmc-portal), a database containing cell models from juvenile rat somatosensory cortex.We focused on neurons in layer (L)5, but the methods described can be applied to a larger variety of neuronal models.The cell models were used unmodified and are based on experimentally reconstructed morphologies, with up to 13 different types of active ion channels, depending on the cell type.The dynamics of these active ion channels have been fitted to reproduce specific features of somatic current-injection experiments, such as the amplitude and width of the evoked action potentials, the depth and timing of the following afterhyperpolarization, the mean interspike interval, spiking adaptations, and the amplitude of the backpropagating action potential at different positions along the apical dendrite of pyramidal cells (PC) (Markram et al. 2015).Importantly, this means that the source of the cell type-specific variability that we observe in, e.g., the spike widths is grounded in experimental data.For a more detailed description of the cell models the reader is referred to Markram et al. (2015).Note also that each of the cell models used can be explored interactively online through the NMC Portal (Ramaswamy et al. 2015, https:// bbp.epfl.ch/nmc-portal/microcircuit#/layer/L5).
In principle, the data set contains 13 types of morphologies (mtypes) in L5, in accordance with the NMC Portal.However, because of the limited data variety in the case of bipolar and neuroglial cells (BP and NGC), we excluded them from the analyses unless elsewhere specified.APPENDIX A describes the data set in detail and explains how we manipulated the original data set to extract unbiased sub-data sets for training and validating the results.A single-cell morphology of each m-type is displayed in Fig. 2, divided into inhibitory and excitatory neurons.Axons are not depicted because only their initial tract is included in the simulations.

Simulated Recordings
Extracellular action potential computation.Each of the multicompartment neuronal models was simulated separately, and extracellular potentials were calculated in two steps.First, transmembrane currents were computed by solving the cable equation (Holt and Koch 1999) with LFPy (Lindén et al. 2014) running on NEURON 7.4 (Carnevale and Hines 2006;Hines et al. 2009).A constant depolarizing current was applied to the soma to get the neuron firing at least 10 times (and not more than 30 times) in a simulation period of 1.2 s.Multiple spikes were simulated to account for spike-to-spike variations due to electrophysiological properties (e-types).All transmembrane currents for each compartment were stored within a time window t ϭ -2 ms and t ϭ 5 ms, where t ϭ 0 is the time of the intracellular membrane voltage peak considered as spike time.Simulations were run with a time resolution of dt ϭ 2 -5 ms, i.e., with a sampling frequency of 32 kHz, so that a single spike window of 7-ms duration (2 ms ϩ 5 ms) consists of 224 samples.
Second, transmembrane currents were used within LFPy to calculate the extracellular potential at the recording site.Each transmembrane current, including the somatic one, was distributed over a line source with the length of its corresponding neural segment.With quasi-static approximation (Nunez and Srinivasan 2006) and assuming a homogeneous and isotropic neural tissue with conductivity ϭ 0.3 S/m (Goto et al. 2010), the contribution of each compartment i at position r i with transmembrane current I i (t) to the electric potential on an electrode at position r j reads (Holt and Koch 1999;Lindén et al. 2014;Pettersen and Einevoll 2008) i (r j , t) ϭ 1 4 The simulated EAP was obtained by summing up the contributions of all compartments.For each recording site, the electric potential was computed at a single point corresponding to the center of the recording electrode.These EAPs are associated with the templates in Fig. 1B.
Only spikes generating a notable waveform with a peak-to-peak amplitude exceeding 30 V on at least one of the electrodes were included in the data set.The detection threshold of 30 V was chosen to mimic experimental settings, where noise in the recordings due to equipment electronics and background neural signals does not allow detection of low-amplitude action potentials.
The coordinate system was fixed in reference to the MEA plane.Each recording site (different MEA designs are explained in MEA designs) lay within the yz-plane, and neuron somata were located within the semispace of the positive x-axis (the x-coordinate is thus the distance from the MEA).For each neuron, 1,000 EAP recordings above the detection threshold were simulated by randomly choosing one of the generated spikes and by placing the soma at random locations with distances x between 10 m and 80 m.The y and z boundaries were different for each MEA, and a neuron could exceed the y and/or z boundary of the MEA by a maximum of 30 m.For the EAP, we considered the contributions of all dendritic compartments, including those crossing the MEA.This was done to force the sum of transmembrane currents to be zero (Pettersen and Einevoll 2008).In other words, we did not clip neurites entering the probe, but we made sure that their position did not coincide with a recording site within LFPy.
MEA designs.We investigated the performance of seven different MEA probes.Five of these were based on the prototype described in Schröder et al. (2015), in which the recording sites are arranged in a 16 ϫ 16 matrix on a penetrating shank.Instead of considering a single interelectrode distance, we investigated five different pitches (i.e., distance between the centers of adjacent electrodes), namely, 10, 15, 20, 25, and 30 m.The probe models were built in a way that they roughly covered the same area on the shank.
Moreover, we simulated recordings on the Poly3-25s probe (Neuronexus Technologies), which has 32 electrodes in three columns with a hexagonal arrangement, a y-pitch of 18 m, and a z-pitch of 22 m.Another probe becoming popular is the NeuroPixels probe (Jun et al. 2017), with recording sites arranged in a checkerboard pattern with a y-pitch of 32 m and a z-pitch of 20 m.Although the probe has 384 recording channels, for convenience we decided to trim it to 128 channels.Finally, we constructed a model of the NeuroSeeker probe (http://www.neuroseeker.eu;used in Neto et al. 2016), a MEA with 128 recording sites arranged in a 4 ϫ 32 configuration and a regular interelectrode distance of 22.5 m. Figure 3 shows all the probes in the yz-plane.
The CNNs we used required a rectangular shape of the input data.The two dimensions of the data array correspond to the number of electrode sites N y and N z in y-and z-directions, respectively.Therefore, we cut the Neuronexus-32 MEA probe to a Neuronexus-30, which is shown in the fourth position from the left in Fig. 3.
Neuron-MEA alignment.We investigated different neuron-MEA alignments (or rotations) of neurons.Some neurons, such as pyramidal cells (PC) or bipolar cells (BP), have morphologies that follow a specific orientation (see Fig. 2) that might affect the classification and localization performance.For this reason, we generated three rotational data sets: Norot: The orientation of the cells (e.g., the apical dendrite of PCs) was along the z-axis (same orientation as in Delgado Ruz and Schultz 2014 and Somogyvári et al. 2012).
Physrot: Neurons with a preferential orientation were randomly rotated such that after rotation their axis from white matter toward the pia pointed into a cone around the z-direction with an opening angle of 15°( the puncture point on the unit sphere is uniformly distributed in this spherical cap).Neurons without a preferential orientation were rotated randomly in the three-dimensional (3D) space.We considered all S q M E A -1 5 -1 0 S q M E A -1 0 -1 5 S q M E A -7 -2 0 S q M E A -6 -2 5 S q M E A -5 -3 0 TTPC1 NGC Fig. 3. Multielectrode array models used in the study.Right: plots show an excitatory cell [thick-tufted pyramidal cell (PC) with late bifurcating apical tuft (TTPC1)] and an inhibitory cell [neuroglial cell (NGC)] to compare probe and neuron sizes.PCs are on average much larger than inhibitory cells, and only a portion of the neuron is located directly in front of the probe (the apical dendrite is not fully shown, and it can be seen in Fig. 2).
The soma positions corresponded to the intersection point of rotation axes and were shifted randomly inside the observation volume in all cases.Figure 4 displays a sample orientation with respect to the MEA of a PC [thick-tufted PC with late bifurcating apical tuft (TTPC1)] in each of the three data sets, Norot (Fig. 4A), Physrot (Fig. 4B), and 3drot (Fig. 4C).

Classification and Localization Features
We extracted features from the EAPs as input variables to a CNN for training.Since classification and localization of neurons from extracellular recordings are quite different tasks, we used different sets of features from the simulated spikes.The pipeline for extracting feature images is described in Fig. 5. First, neurons with known cell type and position were simulated and the spike traces on the MEA probe were obtained.Then, for each spike, a set of features was computed and these features were rearranged in a 2D shape according to the MEA arrangement, i.e., the feature image.In the following paragraphs N, N y , and N z are the total number of electrodes, the number of electrodes in y-direction, and the number of electrodes in z-direction, respectively.
Localization features.The goal of localization is to estimate the soma position with respect to the probe.Therefore, we considered only the EAP negative peak and the positive peak time points, during which negative and positive transmembrane currents are larger in proximity of the soma (Delgado Ruz and Schultz 2014;Gold et al. 2007;Somogyvári et al. 2005Somogyvári et al. , 2012)).For simplicity, we refer to the EAP negative peak as Na peak because, close to the soma, it is mainly attributed to the sodium currents flowing into the soma.The positive peak is referred to as Rep because it is associated with the cell repolarization phase.The peak values were computed with respect to a reference of 0 V, i.e., the baseline, as follows: NA.For each spike recording on N electrodes, the spike with the largest negative peak amplitude was identified.At the time instant when the minimum peak occurred (t min ) the recorded potential on all N electrodes was used to build the Na image (the amplitude values are sampled at the same time instant t min ).
REP.The time instant of the repolarization peak (t max ) was extracted from the spike trace with the largest negative trough (same electrode as Na) and a Rep image was built by probing all N electrodes at t max .
Overall, the localization-specific (N y , N z )-dimensional sets of features are Na, Rep, and NaRep, where the last is a stacked version of both features having dimension (N y ,N z , 2).
Classification features.From each spike, we extracted features that are commonly used for cell classification (Barthó et al. 2004;Peyrache et al. 2012): peak-to-peak width (W), full-width half-maximum (F), and peak-to-peak amplitude (A).The peak-to-peak amplitude A, despite not being one of the commonly used features for classifying neurons from extracellular recordings, was selected as a feature in combination with F and W because spike widths increase with increasing recording distance (Anastassiou et al. 2015;Hagen et al. 2015;Pettersen and Einevoll 2008) and therefore with decreasing amplitude.
The following is a list including a detailed description of the features: Fig. 4. Neuron-multielectrode array (MEA) alignments: in each panel we show a sample orientation of a pyramidal cell [thick-tufted pyramidal cell with late bifurcating apical tuft (TTPC1)] placed at a fixed position of (50 m, 0, 0).The MEA is the SqMEA-10-15, and the colors of each recording site (displayed as an image at top left of each plot) show the qualitative sodium peak image (Na image, explained in Classification and Localization Features).A: Norot: the pyramidal cell main axis is aligned to the z-direction.B: Physrot: the TTPC1 is rotated around the z-axis, and the apical dendrite is tilted at maximum 15°inside a cone around the z-axis.C: 3drot: the neuron can be rotated along all axes and might end up with its apical dendrite entering the MEA probe, as depicted in the plot.While the Na image is similar for A and B, it can become quite different in C, when the neuron is 3D-rotated.A: For each recording site, the peak-to-peak amplitude was extracted as the absolute difference between the negative peak and the following positive repolarization peak.If the amplitude value of a recording site was less than a detection threshold of 5 V, then the amplitude for that electrode was set to zero.W: For each electrode site, the width was computed as the time difference between the negative peak and the following positive repolarization peak.If the amplitude of the corresponding electrode was below the detection threshold (i.e., when the amplitude feature was set to 0), the width was set to the duration of the spike window, which was 7 ms.F: For each electrode site, the full-width half-maximum was computed as the width at 50% of the negative maximum amplitude (refer to Fig. 5 for a graphical visualization and further explanation).In this case, the reference voltage was the initial voltage on the selected electrode site at beginning of the spike window.If the amplitude of the corresponding electrode was below the detection threshold (when the amplitude feature was set to 0), F was set to the duration of the spike window, which was 7 ms.
For classification, we considered the feature combinations AW (N y , N z , 2), FW (N y , N z , 2), and AFW (N y , N z , 3), where the shapes of input arrays to the CNN are indicated in parentheses.
Waveform.We also investigated the performance using the entire spike waveform as input to the CNNs for localization and classification.While localization and classification features focused on amplitudes at specific time points (e.g., Na, Rep) or on extracting significant spike shape parameters (A, F, W), here we took into account the evolution of the action potential in time.As the additional third dimension (2D ϩ time) increased the training time significantly, we downsampled the initial spike waveforms from 224 to 14 samples, i.e., with a downsampling factor of 16.We denoted this feature set, with a shape of (N y , N z , 14), as Waveform.

Convolutional Neural Network
CNNs are biologically inspired artificial neural networks, and they differ from standard artificial neural networks mainly by the use of convolutional layers.The biological inspiration originates from the information processing in the visual system (Krizhevsky et al. 2012;Zeiler and Fergus 2014).For our implementation, we used the opensource software TensorFlow to train and evaluate the network (see Abadi et al. 2016; software available from https://www.tensorflow.org).All computations were done on the HPC clusters NEMO in Freiburg and ABEL in Oslo.
Configuration.We investigated the performance of CNNs of different sizes, all having the same underlying configurations (except for Waveform input features, whose CNN morphology is explained at the end of this section).Five CNN sizes (XS, S, M, L, XL) were used, and they differ in the size k i of convolutional kernels (the index i ʦ {1,2} specifies the convolutional layer), convolutional layer depths d i , and the number of nodes in the fully connected layer n FC.The values used for different sizes are listed in Table B1 Feature images of dimension (N y , N z ) were input to a d 1 -deep convolutional layer with rectified-linear units that filter the input image with (k 1 , k 1 ) kernels and a stride of 1. Then max-pooling was applied, and the image was shrunk to a (n y , n z ) ϭ (N y /2, N z /2) footprint.Another d 2 -deep convolutional layer with rectified-linear units applied (k 2 , k 2 ) kernels, and a second max-pooling operation reduced the output image features to a (m y , m z ) ϭ (n y /2, n z /2) size.The (m y , m z ) features were input to a fully connected layer with n FC nodes.The fully connected nodes were connected to the nodes in the output layer (see Output layer and optimization).
The Waveform feature set differed from the classification-and localization-specific sets as it included time as a dimension.Although some feature images for localization and classification were concatenated and thus had a 3D shape [for example, NaRep had a shape of (N y , N z , 2) and AFW is (N y , N z , 3)-dimensional], the optimized kernels were the same for two or three dimensions.For the Waveform feature set, a 3D CNN was used, i.e., convolutional kernels were 3D with a shape of (k 1 , k 1 , k t ) and (k 2 , k 2 , k t ).Max-pooling was also applied in all three dimensions.For the Waveform feature set, we used a CNN with k t ϭ 4.
Output layer and optimization.The output layer of the network was different depending on whether localization or classification was performed.In case of localization, three output nodes linearly summed the fully connected node inputs and biases to output the x-, y-, and z-coordinates.Optimization minimized the mean squared error between the predicted x-, y-, and z-coordinates and the true soma positions of the training spikes.For classification, there were instead two output nodes in case of excitatory/inhibitory classification.For the m-type classification, we used 11 output nodes, 1 for each cell type under consideration (see Cell Models for a list of m-types).During training, softmax cross entropy was minimized (Goodfellow et al. 2016).
For both localization and classification we used an Adam optimizer (learning rate ϭ 0.0005) (see Kingma and Ba 2014), and we ran 2,000 training epochs.At each iteration, 10% of the training observations were randomly sampled and used to update network weights with backpropagation.To limit overfitting, we used dropout on the fully connected layer (Srivastava et al. 2014) with a dropout rate of 0.3 (during training 30% randomly chosen fully connected nodes were dropped and not considered for updating the network weights).
Validation strategy.To estimate the accuracy of the CNNs, we divided the input data into a training set, used to estimate the CNN parameters, and a validation set, upon which the trained CNN's accuracy was tested.Before the training-validation set division, we preprocessed the data set so that morphologies in the training and validation sets were unique (APPENDIX A).Then, we balanced the occurrence of observations of the same cell type (m-type) by random undersampling.For excitatory/inhibitory classification, we further subsampled the inhibitory neuron observations to match the excitatory ones (in the data set, there are 7 inhibitory cell types-not counting BP and NGC-and 4 excitatory types).The class balancing was performed for training and validation sets separately.For localization (and m-type classification), the training and validation data sets consisted of 44,000 and 11,000 instances, respectively.For classification, we used 32,000 observations for training and 8,000 for validation.

Comparison with Other Models
Localization.In previous work on neural localization, the underlying idea has been to solve the inverse problem by choosing a generative model and minimizing the error between the true extracellular potential and the one predicted by the chosen model.The soma position has been among the model parameters that have been optimized.Several models were used in previous studies: monopole and dipole current-source models (Blanche et al. 2005), line-source models (Somogyvári et al. 2012), and ball-and-stick models (Delgado Ruz and Schultz 2014).We compared our localization approach to inverse problem solutions solved with the EAP negative peak (i.e., Na image) with the following models ( denotes the extracellular conductivity): MONOPOLAR CURRENT SOURCE.A negative monopolar currentsource I s placed at position r s evokes a potential (r) at position r according to The somatic current and soma position are the only parameters to be optimized.The predicted soma position is r s .
BIPOLAR CURRENT SOURCE.Placing a negative current-source I s at position r neg and its positive counterpart (absolute value of I s ) at r pos , the potential at position r reads (r) ϭ (3) In this case, the estimated soma position corresponds to the negative current-source location, which is r neg .This model is equivalent to the two-monopole model in Pettersen and Einevoll (2008).
BALL AND STICK.The ball-and-stick model combines a somatic point-like constant current-source I s at position r s with a dendritic stick of length d len pointing in direction d.We used a modified version of the ball-and-stick model described in Pettersen and Einevoll (2008), since we do not assume net currents to be zero (Delgado Ruz and Schultz 2014).The current along the stick I(x) is assumed to decay exponentially, as confirmed by experimental data (Foust et al. 2010;Goldberg and Yuste 2005;Golding et al. 2001;Gulledge and Stuart 2003;Migliore et al. 1999;Sasaki et al. 2012;Waters et al. 2005).With initial negative value I dend at r s , the current distribution along the stick reads where d denotes the decay constant and x d ʦ [0, d len ] is the position along the dendritic segment (discretized in N seg ϭ 50 uniformly distributed points along the stick of length d len ).The predicted soma location corresponds to r s .The potential at position r is given by the summation of the somatic and dendritic contributions: where each segment is modeled as a line current source (see Eq. 1).
Table 1 summarizes the parameters to be estimated for each described model.
GENETIC OPTIMIZATION.To estimate the model parameters, we minimized the sum of squared errors at each recording site between the extracellular potential predicted by the model and the extracted Na feature image of the true simulated extracellular potential.Optimization was performed with a genetic algorithm implemented with the Distributed Evolutionary Algorithms in Python (DEAP) package (Fortin et al. 2012).We used the ( ϩ ) evolution strategy, which selects the next parents from the common set of the current parents ( individuals) and the offspring ( individuals).More precisely, the algorithm was implemented with the deap.algorithms.eaMuPlus-Lambdafunction.We used ϭ 100, ϭ 200, crossover probability p cx ϭ 0.8, and mutation probability p mut ϭ 0.2.Furthermore, tournament selection (deap.tools.selTournament)and blend crossover (deap.tools.cxBlend)were used for selecting and mating individuals, respectively.Mutation was performed with a random Gaussian mutation (deap.tools.mutGaussian).When an individual was selected for mutation with probability p cx, each of its elements was individually mutated with an individual probability of p ind ϭ 0.3.Gaussian means for all parameters were set to zero, and standard deviations (SDs) were different depending on the parameter.The parameter values are summarized in Table 2, and we constrained the optimization to solutions within biophysically acceptable boundaries (shown in Table 2).
Classification.Besides applying a CNN, the problem of classifying neurons according to their EAP can be done by several other methods (Barthó et al. 2004;Delgado Ruz and Schultz 2014;Peyrache et al. 2012).It is a well-established observation that pyramidal excitatory cells present a broad spike waveform, while inhibitory cells have a narrow one (Barthó et al. 2004).Therefore, a standard way of classifying between the two classes is to plot spike width W and full-width half-maximum F (see Classification and Localization Features for feature extraction details) of the EAP with the maximum amplitude and then cluster the data points in this 2D space (Barthó et al. 2004;Peyrache et al. 2012).Once F and W were computed for the electrode with the maximum peak-to-peak amplitude, we applied two different clustering techniques to the point cloud: k-means and a mixture of Gaussians (MoG) clustering (Pedregosa et al. 2011).While k-means clustering iteratively assigns points to K clusters based on their distances to the cluster means and then recomputes the cluster means with new assignments until convergence, the MoG estimates K Gaussians to fit the data and then labels the data points based on the Gaussian shape.In this case, since the goal is excitatory/inhibitory classification, we set K ϭ 2.

Statistical Analysis
For localization errors, statistical tests were run on the 11,000 validation observations: since all distributions did not satisfy the normality assumption, nonparametric tests were run (1-sided Mann-Whitney U-test; Mann and Whitney 1947).When sample sizes are large, statistical tests are prone to indicate that there is a significant difference (effect) between distributions, resulting in low P values.To characterize whether such difference is relevant, a measure of its magnitude, or effect size, should be included (Sullivan and Feinn 2012).To quantify the effect size, we used Cohen's d coefficient (Cohen 1992), i.e., the difference between population means normalized by the pooled SD.We considered significant differences (low P values) to be negligible (effect size Ͻ 0.2), small (effect size Ӎ 0.2), medium (effect size Ӎ 0.5), or large (effect size Ӎ 0.8).Test results are shown in APPENDIX B, divided by group (data set rotation, Table B3; CNN size, Table B4; feature type, Table B5; probe type, Table B6; and localization method, Table B7).Each entry of the tables shows the Cohen's d coefficient (rounded to 2 decimals) and the significance of the Mann-Whitney U-test with the alternative hypothesis that column group Ͻ row group.

RESULTS
In the following sections, we show localization and excitatory/inhibitory classification results only of m-type cells included in the training data set.Therefore, unless otherwise specified, BP and NGC are excluded from the analysis.The performance measures were different for localization and clas-  sification.In case of localization we used the average total error and for classification the average classification accuracy (ratio between correctly classified observations and total number of observations).Moreover, we analyzed the cell-specific accuracy to get more insight on the classification performance.The average localization error in the x-, y-, and z-directions can be interpreted in the following way.Assuming normally distributed errors, the true soma location is with 32% probability inside a box centered at the predicted soma position with edge lengths of twice the average localization error in the corresponding dimension.The probability rises to 87% with a box edge length of four times the average localization error in the corresponding dimension.

Dependence on Neuron-MEA Alignment
The first question we investigate is how the neuron-MEA alignment affects the localization and classification performance.Three data sets were built, Norot, Physrot, and 3drot.To focus on the effects of alignments, we use fixed feature sets (NaRep for localization, FW for classification; see Classification and Localization Features for definitions), MEA probe (SqMEA-15-10), and CNN size L.
Localization.First, we show the mean and SD of the localization errors along the three axes as well as the total error in Table B2.Each row displays the performance of a rotational data set.Average errors and SDs are 7.3 Ϯ 5.7 m for Norot, 7.8 Ϯ 6.3 m for Physrot, and 8.9 Ϯ 8.2 m for 3drot.
The Norot data set results in significantly lower errors with respect to the 3drot data set (effect size ϭ 0.21; Table B3).Negligible differences are found in the comparison between the Norot errors and the Physrot errors (effect size ϭ 0.09) and between the Physrot and 3drot data sets (effect size ϭ 0.13).Taking into account the finite size of the soma (ϳ10 -15 m of diameter), we consider the resulting error values to be a sufficient performance for most applications.The errors along the three axes appear to be isotropically distributed, as they show similar values in all directions (but the observations in the x-direction are not uniformly distributed-as shown below in Fig. 9C-since we only considered spikes above 30 V and spike amplitude decreases with distance).
In Fig. 6, A-C, we show the errors along x-, y-, and z-axes with respect to the x-, y-, and z-coordinates for the three rotational data sets.In these plots, we bin the true x, y, and z neuron positions in seven bins and treat them as categorical data (i.e., the positions can have discrete values depending on the bin they belong to).The data points are the mean of the error grouped by bin and rotation type and the error bar is the SD. Figure 6A shows that errors in the distance estimation (x-direction) tend to increase as the distance of the neuron increases for all three data sets, similarly to Delgado Ruz and Schultz (2014).Regarding the y-and z-dimensions (Fig. 6, B and C, respectively), it is interesting to note how the errors have a convex shape, meaning that errors tend to increase when the neuron is at the border of the probe, in which case only partial information about the spike is available.When looking at the distribution of the predicted x-, y-, and z-coordinates with respect to the true coordinates in Fig. 6, D-F, one can note how the errors observed in Fig. 6, A-C, are caused by an underestimation of the soma distance in all dimensions: at large distances (x-direction) from the probe, neurons are predicted to be closer to the MEA; when they are close to the y and z borders of the probe, the predicted position is closer to the center of the MEA.
Next, we consider the variability of localization performance depending on cell types and alignment.In Fig. 7A the bar plots show the average total errors and their SDs grouped by neuron morphology type (11 training morphologies ϩ BP and NGC; see Fig. 2 for representative cells) for the Norot, Physrot, and 3drot data sets.The range and distribution of distances taken into consideration are the same as in Fig. 6.Focusing on the Physrot data set, the minimum error of 4.7 Ϯ 3.8 m is obtained for the SBC morphology, while the worst performance is 15.9 Ϯ 9.1 m for slender-tufted PC (STPC) morphology.The difference in prediction performance with respect to cell type does not seem to be depending on excitatory/ inhibitory morphologies (i.e., pyramidal and nonpyramidal cells), nor do they look to be clustered depending on morphological subclasses; for instance, among the different basket cells (names ending with BC) there is some variability among large basket cells (LBC), NBC, and SBC, and the same holds for pyramidal cells [STPC, TTPC1, thick-tufted PC with early bifurcating apical tuft (TTPC2), and untufted PC (UTPC)].The performance of BP and NGC (Fig. 7A), which were not used for training, is in line with other cell types, with errors of 8.8 Ϯ 2.2 m and 9.4 Ϯ 5.1 m, respectively.This result confirms that the method is capable of dealing with diverse morphologies, as long as the training set contains a large representation of cell types.
Classification.The accuracy analysis of excitatory/inhibitory classification is based on the FW feature set.Table B8 in APPENDIX B shows the classification accuracies for each cell morphology plus the average accuracy and the SD for the different data sets.The cell-specific accuracies are also visualized by color coding in Fig. 7B.
The average classification accuracy is equally high for the Norot and Physrot data sets (98.1 Ϯ 2.4% and 98.0 Ϯ 3.9%, respectively) and lower for the 3drot data set (97.6 Ϯ 3.9%).This is because neurons are rotated with more degrees of freedom; nevertheless, on average the accuracy remains very high in all cases.A closer examination of this result reveals that the main reason for the drop in classification accuracy was misclassification of the chandelier cells (ChCs).The lowest value is the ChC accuracy in the 3drot data set (84.1%).In Fig. 7C, we show the spike shapes in the FW-plane of the electrode site with largest amplitude.Inhibitory neurons mainly lie in the lower left part (narrow spikes).Excitatory neurons are almost perfectly classified as excitatory cells, as shown in Table B8 and Fig. 7B.The spike shapes of ChC in the FW-plane mainly lie at the interface with the excitatory neurons.This might explain why they are harder to classify correctly with respect to the other cell types.

Effect of CNN Size
We investigated how localization and classification performances vary with network size.The results shown in this section were obtained with the SqMEA-10-15 probe, NaRep features for localization, and FW for classifications.For the remaining analyses, boxplots and cumulative distribution functions (cdfs) are used to represent the performance of the localization models.In all boxplots, the box is the interquartile range (IQR), i.e., the 25th and 75th percentiles, the horizontal lines inside the box show the medians, and the red diamonds display the means.The whiskers (horizontal black lines) represent the highest and lowest data values within 1.5 times the IQR.Data points outside the whiskers are plotted as black dots and are regarded as outliers.We obtained the cdfs by sorting the sample and pairing each data point with its normalized rank (percentile).Hence, the point where the cdf crosses 0.5 represents the median of the localization error.Localization.Figure 8A shows the localization errors grouped by CNN size (XS, S, M, L, and XL).Increasing the size of the network improves the performance significantly (Table B4), but for sizes L and XL the average localization error is almost the same [7.8Ϯ 6.3 m for L and 7.3 Ϯ 5.8 m for XL (effect size ϭ 0.09)].If not stated otherwise, networks of size L have been chosen, as they provide a good compromise between performance and time required for training.
Classification.Table B9 in APPENDIX B shows the performance in classification into excitatory and inhibitory neuron types.The highest accuracy (98.6 Ϯ 1.1%) is reached with a network of size M, while all others show a slightly lower performance.A possible explanation for the lower score of the XL network is overfitting to the training set because of the large number of parameters.

Feature Selection
In the previous sections, we have presented results with fixed feature sets (NaRep for localization and FW for classification), eliminating the effects caused by other factors, such as alignment, cell type and CNN size.The following results were obtained on SqMEA-10-15 probes using CNNs of size M (because of the long training time required by 3D CNNs for Waveform feature).
Localization.In Fig. 8, C and D, we display the boxplots and cdf of the errors with varying feature sets for localization.In other studies, either the sodium peak is the only feature used (Blanche et al. 2005;Delgado Ruz and Schultz 2014;Mechler et al. 2011;Mechler and Victor 2012;Somogyvári et al. 2005) or the entire spike time course is modeled (Somogyvári et al. 2012).Here we show that all CNNs relying on peak input show roughly the same performance: average errors Ϯ SDs are 8.8 Ϯ 7.1 m for Na, 8.8 Ϯ 7.9 m for Rep, and 8.8 Ϯ 7.7 m for NaRep.Negligible differences are found when comparing Na, Rep, and NaRep, with effect sizes close to zero (Table B5).
The Waveform CNN results in a lower average prediction error of 6.9 Ϯ 6.5 m, which is significantly better in comparison with Na (effect size ϭ 0.28), Rep (effect size ϭ 0.26), and NaRep (effect size ϭ 0.27; Table B5).We speculate that the performance of the Waveform approach is only slightly increased (by ϳ2 m) for the following reason: when considering the peaks only, transmembrane currents are mainly concentrated around the soma (Delgado Ruz and Schultz 2014; Gold et al. 2007;Somogyvári et al. 2005Somogyvári et al. , 2012)); therefore, the peak features contain almost all information the CNN needs for soma location.
Classification.Classification performances are listed in Table B10 in APPENDIX B. The AFW feature set, with an accuracy of 98.6%, performs better than AW and FW, with accuracies of 98.1% and 97.0%, respectively.The Waveform feature set, which uses a downsampled version of the entire spike, performs almost perfectly on the classification task (accuracy 99.7%).Given these results, the Waveform feature set is better than the other approaches, at the expense of more computationally demanding training procedures.

Performance with Different MEA Probes
We built simulated spikes using eight different MEA models: five of them are square arrays with varying pitch, and the  (for A and B), Table B5 (for C and D), and Table B6 (for E and F).
other three are the NeuroSeeker (Neto et al. 2016), NeuroPixels (Jun et al. 2017) (trimmed to 128 channels), and Neuronexus-32 (clipped to 30 electrodes to make it rectangular) probes.In the following paragraphs, we present the capabilities in terms of neuron localization and classification for the different probes.All simulations shown in this section make use of CNNs of size L, NaRep features for localization, and FW for classification.
Localization. Figure 8, E and F, show localization errors for the eight different probes (boxplots and cdf).Although an error reduction can be observed from square MEA with 30-m pitch to 10-m pitch, as expected, even with a relatively low density (30-m pitch) a CNN can learn localization models with an average error as low as 8.4 Ϯ 6.4 m for the SqMEA-5-30.As a comparison, the average error for the probe with 10-m pitch (SqMEA-15-10) is 7.6 Ϯ 6.4 m.The errors are in the same order also for the Neuronexus probe (mean of 8.5 Ϯ 7.2 m) and for the NeuroSeeker probe (mean of 9.3 Ϯ 7.7 m).When evaluating the performance on 128 sites with the arrangement of the NeuroPixels probe, the average error is 10.8 Ϯ 8.5 m.One may note that even though the NeuroSeeker and Neuronexus probes have lower pitch (NeuroSeeker: 22.5 m in yand z-axes; Neuronexus: 18 m in y-axis and 25 m in z-axis) compared with SqMEA-5-30, their localization error is higher.The reason for this discrepancy might be in the arrangement of the electrodes: while the SqMEA-5-30 has an effective width (considering point electrode contacts) of 120 m, for the NeuroSeeker the effective width (considering point electrode contacts) is 67.5 m and for the Neuronexus it is 36 m.Hence on the NeuroSeeker and Neuronexus probes there is less spatial information in the y-direction, which may explain the reduced localization accuracy.
In general, most comparisons show negligible differences (effect size Ͻ 0.2), except for the NeuroSeeker and NeuroPixels probes.The NeuroSeeker probe performs worse than the high-density square MEAs (SqMEA-15-10: effect size ϭ 0.25, SqMEA-10-15: effect size ϭ 0.22), while the NeuroPixels errors show effect sizes above 0.2 in all comparisons (ranging from 0.43 compared with SqMEA-15-10 to 0.3 compared with Neuronexus) except for the comparison with the NeuroSeeker probe (effect size ϭ 0.19).In case of the NeuroPixels probe the checkerboard arrangement might pose additional difficulties, resulting in even lower performance.
Classification.Table B11 in APPENDIX B shows accuracies for classification with different probes.The average accuracies are very high and almost the same for all probes, from a minimum of 96.6% (SqMEA-6-25, SqMEA-15-10) to a maximum of 98.6% (NeuroPixels-128).

Comparison with Other Approaches
In this section, we compare the CNN approach to other state-of-the-art methods.For localization, we used the monopolar, bipolar, and ball-and-stick models described in Comparison with Other Models to solve the inverse problem on our simulated data sets.Hence the results obtained for other methods might be different with respect to ones in the literature because the number of cell models, the utilized probes, and the neuron-MEA alignments vary.For characterization of excitatory and inhibitory neurons, we compared with commonly used clustering techniques.
Localization.For localization, we use the validation data set on the SqMEA-10-15 probe.The CNN errors displayed in the plots are obtained with the NaRep feature set and a network of size L. In Fig. 9, we show the errors of the simplified models described in Comparison with Other Models and for the CNN method.
We found that the CNN performs significantly better than the inverse approach in all cases, with an average error and SD of 7.8 Ϯ 6.3 m.The large differences between the CNN and the other methods' error distributions are confirmed by the effect sizes: 0.9 for the monopolar approach, 0.68 for the bipolar approach, and 0.87 compared with the ball-and-stick approach (Table B7).Among the models used to solve the inverse problem, the monopolar has a mean error and SD of 21.7 Ϯ 20.9 m, the bipolar model of 15.6 Ϯ 15.2 m, and the ball-and-stick model of 22.6 Ϯ 23.2 m.The better performance of the bipolar model with respect to the monopolar model (and ball-and-stick model) can be due to the fact that it is the only model capable of predicting negative and positive potential values on the MEA.Dendritic branches act as current sources when the soma is depolarized, causing positive deflections in the extracellular potential (Pettersen and Einevoll 2008).
Studying the probability density function (pdf) of the predicted coordinates by different models in Fig. 9, C-E, the monopolar model tends to underestimate the distance (x-coordinate) from the MEA (note sharp peak in the distribution in Fig. 9C).In the y-and z-axes, instead the predictions are closer to the center of the MEA when observations lie outside the boundary of the probe (note the different steepness and shape of the monopolar pdf with respect to the true pdf in Fig. 9, D and E, close to Ϫ100 m and 100 m).Similarly, the bipolar model also underestimates distances in the x-direction, but the underestimation is less severe.In the y-and z-directions it nicely follows the true distribution.The ball-and-stick model has distributions very similar to the monopolar model in all three directions.The CNN approach, on the other hand, is the closest match to the true distribution in all three dimensions.Note that the distribution in the x-direction is not as uniform as in the y-and z-directions (the density decreases with increased distances) because we discarded spikes with a peak-to-peak amplitude below 30 V here.
Classification.For excitatory/inhibitory classification we compared the performance of our CNN approach to standard clustering techniques in the FW space (F: full-width halfmaximum, W: width).In Fig. 10A, we show the validation data with the SqMEA-10-15 probe and the excitatory/inhibitory balanced data sets (8,000 observations).Each point is computed from the recording site with largest amplitude.Although it is true that inhibitory cells cover the bottom left part of the cloud (narrower width and full-width half-maximum) and excitatory cells the top right (wider spike shape), we can observe that there is some overlap between the two groups.When we apply k-means clustering (Fig. 10B), the algorithm correctly assigns the bottom left part to inhibitory neurons and the top right part to excitatory neurons, but the overlap is mainly assigned to the excitatory class.This yields an accuracy of 99.9% for the excitatory class but only 60.7% for the inhibitory one, with an average of 80.3%.When it comes to the MoG, the data are fit to two multivariate Gaussians and labels are assigned based on the probability of an observation to belong to the two distributions.Figure 10C shows the estimated Gaussians (ellipses) and the labeling of the points.Although the MoG is capable of describing the diagonal shape of the excitatory cloud, the overlap between the observations cannot be untangled, resulting in an accuracy of 98.8% for the excitatory class and 54.2% for the inhibitory one, with an average of 76.5%.The CNN method, instead, is able to discern the overlap in the FW space.This is certainly due to its higher complexity, due both to the method itself and to the use of all electrodes' information, not only those with highest amplitude.The CNN result shown here (FW feature set, size L) allows us to correctly predict excitatory cells in 98.9% of the cases and inhibitory cells with an accuracy of 97.1%.This makes it the best-performing method among those compared here, with an overall accuracy of 98.0%.It could be argued that the comparison was somewhat unfair, as our CNN approach considers F and W images (computed on all recording sites), while the clustering is performed with values computed from the electrode with highest amplitude only.Nevertheless, it is not common practice to consider waveforms on all electrodes but only on the one with highest amplitude (Barthó et al. 2004;Peyrache et al. 2012).

m-Type Classification
In addition to separating excitatory cells from inhibitory ones by trained CNNs, we tried to make a finer subdivision and classify cells into morphology classes (m-type) based on the EAP.The approach is similar to that for excitatory/inhibitory classification, but instead of only 2 output classes we take the 11 m-type classes (cells of m-type BP and NGC are excluded, since only 1 morphology is available in the data set).We use a CNN of size L and consider Waveform features (in this case with a downsampling factor of 8, i.e., a sampling frequency of 4 kHz) on the SqMEA-10-15 with the Physrot data set.The resulting confusion matrix E, in which each entry E ij represents the amount of observations of the true cell type i predicted as cell type j, is depicted in Fig. 11.We do not observe a striking diagonal, indicating that full identification of all cell types is not feasible from EAPs.But it is noteworthy that there is some block structure dividing excitatory neurons from inhibitory neurons.This division is learned intrinsically by the network, and inhibitory cells are classified within the inhibitory block in 100.0% of the cases and excitatory cells within the excitatory block in 95.7%.Concerning the mixing of TTPC1 and TTPC2, we do not expect to be able to differentiate between these two types because their only difference is the distance of the bifurcation point of the apical dendrite to the soma.Since the MEA is located close to the somatic region, recordings might not be sensitive to this delicate difference.Disregarding this mixing, the m-type classification performs well (chance would be 9.1%) on excitatory cells and inhibitory Martinotti cells (MC) (80.5%).Note that these well-classified cells make up a large proportion of cortical cells.The overall accuracy of 34.0% illustrates that the morphological details are partially resolved by the CNN.In cases in which the CNN is not able to extract the information about the morphological details, it is unclear whether the information is present at all in the EAP or an increased number of cell models could solve the problem.In  B7.
conclusion, the results show promise for a more refined classification than only distinguishing excitatory cells from inhibitory cells.

Validation on Different Models
To investigate how general the trained CNN models are, we tested the performance of localization and classification on simulated EAPs from other neuronal models, namely, the cell model from Hay et al. (2011) and the models from the Allen Brain Institute (ABI) cell type database (Gouwens et al. 2018; http://celltypes.brain-map.org).For the following results, we used the SqMEA-10-15 probe, CNNs of size L, and NaRep and FW feature sets for localization and classification, respectively.
Hay model.The Hay cell models a neocortical pyramidal cell from L5b, and the techniques used to build the models were similar to the models from the NMC Portal.Therefore, we expect a relatively good performance in localization and classification with the CNNs trained on our standard NMC data sets.We built a Physrot data set of Hay cells consisting of 1,000 observations at random locations around the probe as described in Simulated Recordings, and we then evalu-ated the performance of the CNNs in localization and classification.
For localization, the average error on the Hay data set is 8.7 Ϯ 6.6 m, perfectly in line with the average errors of TTPC models in the NMC validation data set (Fig. 7A).The average error over all cell types in the NMC validation data set is 7.8 Ϯ 6.3 m.For classification, we obtain an average accuracy of 76.4%, while the accuracy on the NMC validation set is 98.0%.The lower accuracy could be due to the fact that the Hay model includes other types of mechanisms, such as active calcium channels in the apical dendrites, that are not modeled in the NMC cell models.
Allen Brain Institute models.The cell models from the ABI that we selected are quite different from the NMC cell models at least for two reasons.First, the ABI neurons are from mice, whereas the NMC cells are from juvenile rats.Second, they are from visual cortex (19 cells) and postrhinal area (1 cell), whereas the NMC models are from somatosensory cortex.With CNNs trained on NMC data are expected to have lower accuracy when applied to the ABI data.We generated 1,000 EAPs for each of the 20 ABI cell models, according to the description in Simulated Record- Fig. 10.Comparison with standard clustering approaches for excitatory/inhibitory classification.Each point is 1 spike in the FW-plane; red dots are inhibitory cells, and blue dots are excitatory cells.A: true excitatory/inhibitory classes: it is evident that there is some overlap between the groups in the center of the plot.B: k-means prediction: k-means clustering splits the data in 2 groups and cannot untangle the overlap (accuracy: 80.3%).C: mixture of Gaussians (MoG) prediction: MoG clustering correctly finds the diagonal shape of the excitatory population, but it cannot separate the overlap either (accuracy: 76.5%).D: convolutional neural network (CNN) prediction: the CNN output is able to correctly predict almost all the observations, with an accuracy of 98.0%.FWHM, full-width half-maximum.
ings.We used the 3drot alignment because of the variability in the ABI cell models' orientation (for details about selection of cell models see APPENDIX A).We ran the 3drot CNN for localization on the ABI data set, obtaining an average error of 19.3 Ϯ 11.5 m, larger than the 8.9 Ϯ 8.2 m obtained on the NMC validation set as expected.For classification, we distinguished excitatory and inhibitory cells in the ABI data set based on mouse transgenic lines (details in APPENDIX A).With the CNN for classification trained on NMC models, the average accuracy is 76.9%, while it is 97.6% on the NMC validation data set.
Since the cell models of the ABI come from a different species and are from a different cortical region, we trained a CNN on this data set only-16 models are used for training and 4 for validation (APPENDIX A).We used a CNN of size L and NaRep features, obtaining a localization average error of 5.9 Ϯ 4.5 m, which is in line with the performance we obtained on NMC models only.We did not run classification with so few models (only 20 cell models in total), because the CNNs need a larger diversity to find general features related to excitatory-inhibitory types (using the NMC data we trained on 192 cell models).

Test on Experimental Data
Although the method proposed in this report is at a proofof-concept stage, we tested some CNNs trained with simulated data on experiments at least for plausibility.
We decided to use data (publicly available at http:// www.kampff-lab.org/validating-electrodes)from paired juxtacellular and extracellular recordings (Neto et al. 2016) where, to a certain extent, the ground-truth location is known.The extracellular signals are measured with either the Neuronexus or the NeuroSeeker probe.Taking the amplitude threshold of our CNN training simulation (peak-to-peak amplitude of 30 V on at least 1 electrode) into account and considering only cells in front of the MEA, we were left with 10 data sets (see APPENDIX A for further details).After performing juxtacellulartriggered averaging, we fed the average EAP waveform into the CNN and predicted the soma position.The CNNs were trained with simulated data having the appropriate geometric alignment (MEA probes are rotated by Ϫ48.2°along the y-axis).On average the prediction error is 42.2 Ϯ 16.8 m, assuming the true soma position is the tip of the juxtacellular probe.The experimentally determined positions for the x-, y-, and z-coordinates range from 27 m to 129 m, Ϫ48 m to 6 m, and -121 m to 21 m, respectively.Neto et al. (2016) report a distance uncertainty of 10.5 Ϯ 5.2 m.This uncertainty only applies to the tip position of the juxtacellular probe with respect to the MEA, but it is a drastic assumption to consider this position equal to the soma position (center of the soma).Without neglecting the soma diameter, one might favor a larger distance error by adding a soma radius uncertainty of 10 -20 m.Furthermore, the uncertainty of 10.5 Ϯ 5.2 m reflects misalignment errors caused by the manipulators used in the experiment and is investigated under free conditions, i.e., without entering any brain tissue.Additional misalignment originating from entering brain tissue with the probes or from the brain's pulsation due to breathing of the mouse are not taken into account.
Figure 12 shows predicted vs. true coordinates.Figure 12A demonstrates that the predicted soma distances (x-coordinate) are in a plausible range.Overall, the distance x is predicted with a mean error of 20.3 Ϯ 16.1 m.The y-and z-positions are in the same range of precision (22.3 Ϯ 13.1 m and 22.9 Ϯ 15.7 m, respectively, on average).Note that the horizontal error bars in the plots only represent the uncertainty due to the misalignment of the manipulators as reported by Neto et al. (2016) and all other previously mentioned uncertainties (which are not quantified) are not considered.

DISCUSSION
This work provides a deep learning approach for neuron localization and classification based on MEA recordings.We simulated in vivo-equivalent EAPs and built data sets for various probe designs, using a multitude of cell models from the NMC Portal (205 cell models from Ramaswamy et al. 2015).CNN models trained on these simulated spikes predict the soma position of the neuron and characterize whether it is excitatory or inhibitory.The accuracy depends on the neuron-MEA alignment, the specific cell types, the CNN size itself, and the input feature sets.For completeness, we compared the proposed method with existing strategies regarding both localization and classification of recorded spikes, we validated on cell models from other databases, and we tested the models on publicly available experimental data.

Localization
We showed that the CNN method is robust and accurate in predicting the 3D soma location from spikes generated by neurons with a physiological neuron-MEA alignment (Physrot, defined in Neuron-MEA alignment).The average errors are on the order of 7.6 -11.7 m for all probes involved in the study (Fig. 8E).We demonstrated the CNN approach to be robust with different cell models and to be able to generalize among cell types not used for training (BP and NGC).Finally, local- ization performances achieved with our approach are significantly better than solving the inverse problem with various generative models.With a SqMEA-10-15 data set the total error in three dimensions was 21.7 Ϯ 20.9 m for the monopolar current source, 15.6 Ϯ 15.2 m for the bipolar current source, and 22.6 Ϯ 23.2 m for the ball-and-stick model, whereas with our CNN approach we obtained an error of 7.8 Ϯ 6.3 m (as shown in Fig. 9).
In a recent study (Delgado Ruz and Schultz 2014), a Neuronexus-32 probe was used (shown in Fig. 3) with a modified ball-and-stick model to solve the inverse problem.For the five cell types considered in Delgado Ruz and Schultz (2014), they reached average errors of 6.26 Ϯ 6.10 m, 6.03 Ϯ 7.68 m, and 2.58 Ϯ 4.75 m along the x-, y-, and z-axes, respectively.Using the same probe on our Physrot data set, we obtained with CNNs average errors of 4.1 Ϯ 4.5 m, 4.3 Ϯ 4.7 m, and 4.3 Ϯ 5.3 m for the x-, y-, and z-axes, respectively (with NaRep feature set and size L).

Classification
The deep learning method was applied to excitatory/inhibitory classification with accuracies above 96.6% for all employed MEA models using the FW feature set and a CNN size L.An almost perfect outcome of 99.7% was obtained with the Waveform features on the SqMEA-10-15 probe.Compared with standard strategies using spike widths extracted from the spike shape, it showed a significant improvement (k-means clustering: 80.3%, MoG: 76.5%; Comparison with Other Approaches).We also attempted to distinguish among 11 cell morphologies (m-type classification).The overall accuracy of 34.0% is substantially better than the chance level of 9.1%.It is interesting to see that m-type classification performs a sort of unsupervised learning, as inhibitory cells were classified as inhibitory in 100.0% of the cases and excitatory cells as excitatory 95.7% of the time.

Overfitting and Stability
When evaluating the predictions of our CNNs on the validation data set, we observed a drop in accuracy compared with training accuracy.The drop is in an acceptable range for excitatory/inhibitory classification (0 -3% with respect to the training accuracy) and localization (up to 3.7 m prediction error increase).In case of m-type classification, the validation accuracy drops ~65% compared with training accuracy, clearly indicating overfitting.Since we do not have enough diversity in cell model data to build a third data set for implementing early-stopping regularization (i.e., stop training as soon as the generalization error increases), we tracked the evaluation accuracies depending on the number of training epochs.In most cases, they reached a plateau after roughly 2,000 training epochs and did not decrease significantly afterwards, while training accuracies still increased.Therefore, we decided to stop training after 2,000 training epochs, assuming that the CNN has extracted most of the generalizable information provided by the EAPs at that point.Moreover, we tried to quantify the stability of the performance depending on different initial weights before the optimization process.To do so we ran the CNN training for localization and classification (on the SqMEA-10-15 probe, CNN size L, and with NaRep and FW features, respectively) six times with different random seeds.We obtained an average mean error of 7.6 Ϯ 0.1 m with an average SD of 6.3 Ϯ 0.2 m for localization (including the BP and NGC models) and an average mean accuracy of 97.9% with a SD of 0.2% for classification, indicating that performance is not dependent on the initial conditions of network weights and the convergence is robust.

Model-Based Approach
The findings presented in this study are based on simulations.Although this might be regarded as a limitation, we want to stress that the proposed method makes use of highly detailed cell models (Markram et al. 2015) and the complexity of such models is maintained and learned by CNNs.Previous approaches to localization and/or classification relied on simple forward models to solve the inverse models-monopolar, bipolar, ball-and-stick models, etc. (Blanche et al. 2005;Delgado Ruz and Schultz 2014;Somogyvári et al. 2012).We showed that CNNs outperform these models in estimating the soma positions.Another point that plays in favor of the use of neural simulations is the difficulty in gathering ground-truth data experimentally.Localizing and classifying neurons in real recordings requires advanced and highly accurate equipment, and the recorded labeled data would most likely still not be sufficient to train data-hungry machine learning algorithms such as CNNs.Nevertheless, validation on experimental data is definitely a required step and will be based on combined approaches with paired electrophysiological recordings and standard microscopy (Neto et al. 2016), or even involving more sophisticated and precise imaging techniques, such as twophoton imaging (Göbel and Helmchen 2007), which was paired with electrophysiological recordings in vivo in Shew et al. (2010).Paired electrophysiology and two-photon microscopy data, possibly in combination with intracellular voltage monitoring through patch clamping or voltage-sensitive dyes, could also represent a valuable tool to further validate and improve the forward modeling schemes, providing morphological, intracellular, and extracellular recordings simultaneously.Another advantage of using forward modeling is that the performance of the machine learning algorithm could be improved by building case-specific data sets that better match the real experimental scenarios.In this work, we assumed that the simulated probe was inserted in L5 of somatosensory cortex of a juvenile rat with a vertical insertion angle.However, somatosensory cortex can present large differences with respect to other brain regions (e.g., hippocampus, cerebellum, or other cortical regions) but also among animal species.Therefore, we do not envision a single universal model to localize and classify neurons but species-and brain area-specific CNNs to accurately deal with variability in neuronal types and functions.For example, when we fed mouse data from the ABI database, the localization CNN trained on rat data performed relatively poorly (19.3 Ϯ 11.5 m; Allen Brain Institute models), but trained on mouse models the performance is in line with what we obtained on rat data (5.9Ϯ 4.5 m).

Effect of Probe Design
Regarding neural probes, a forward modeling-based approach can give important insights for the design and manufacturing of next-generation probes.For example, our results showed that even relatively low-density probes, such as the SqMEA-5-30, despite performing slightly worse than higherdensity probes, still yield high accuracy in localizing and classifying neurons.Potentially, the pursuit of extremely highdensity probes, which makes the design complicated and the data throughput very high, is not required for classification and localization tasks [although it might still be important for spike sorting (Franke et al. 2012;Rossant et al. 2016)].However, for such simulation-driven MEA design, the simulations lack a more accurate electrode model considering finite size recording sites (in this work we used an ideal point electrode), electrode impedances, and transfer functions.

Future Extensions
The generative model for spike simulations could be improved in various ways.A straightforward improvement to obtain more accurate simulations could be including the MEA scar in the data generation, by clipping or bending neuronal branches in the proximity of the probe before simulating the recordings.Another refinement might be to take into account the finite size effects of the electrode contacts by means of the disk-electrode approximation (Lindén et al. 2014), which was shown to be appropriate for current sources positioned at distances larger than the contact radius (Ness et al. 2015).Moreover, here we assumed a tissue with homogeneous and isotropic electrical properties, but experimental findings suggest that in the cortex the conductive properties of the extracellular space are anisotropic (Goto et al. 2010).Anisotropy could be easily taken into account for the simulation of spikes (Ness et al. 2015;Pettersen et al. 2012).As the proposed approach strongly relies on high-fidelity simulations that reliably describe the neuron dynamics and volume conduction, another strategy could be using finite element method-based models, as in Agudelo-Toro and Neef (2013), Pods et al. (2013), andTveito et al. (2017), which would result in more detailed simulations at the cost of a much higher computational cost for data generation.Another layer of modeling is the electrode-tissue interface.The generated data should include electrical properties of the electrodes, such as the impedance, and account for their variability in experimental scenarios.In this work, we used polytrodes with a relevant size with respect to the neuron: although we assumed a homogeneous medium, the presence of the probe itself represents an obstacle for electrical signal propagation and can be modeled with either finite element method or analytical simplifications, such as the method of images (Ness et al. 2015).
In this work, we did not include any noise in the simulated recordings.The rationale behind this choice is that sorted spikes can be cleaned by applying spike-triggered averaging.With spike-triggered averaging, additive random noise is reduced by a factor of ͙ N, where N is the number of occur- rences of the sorted unit.Moreover, a common problem in spike sorting is electrode drift, in which the relative position between a neuron and the recording electrodes changes during the experiment.If drifting is detected from the spike sorting algorithm, one could feed different averaged EAPs computed in separate time windows and evaluate the drift over time, similarly to Delgado Ruz and Schultz (2014), in which windows of 5 min were used to compute the mean EAP.
Furthermore, the recording site area affects the amount of noise in the recordings, as the recording area is related to the impedance of the electrode.Here we assumed perfectly sorted spikes, from which a clean EAP can be computed.Clearly, with experimental data errors in spike sorting would affect the performance of localization and classification due to distorted waveforms from wrong assignments.

Outlook
Precise neural localization and classification from in vivo extracellular recordings has the potential of making electrophysiology an even more powerful technique to interact with neural tissue.Rather than only extracting spike trains, we could build a 3D representation of the recorded units and perform functional electrical imaging to study the spatial interactions among different cell types in neural microcircuits.On top of this, a precise localization of neuronal somata might enable the use of highly selective electrical stimulation patterns (Buccino et al. 2016) and represent an advancement in single-neuron stimulation from extracellular probes.
We strongly believe that computational approaches must go hand in hand with experimental ones, and an extension of this work might include the simulations of the entire pipeline from simulated MEA recordings, for example, with VISAPy (Hagen et al. 2015), to electrical stimulation including spike sorting, localization, classification, electrical stimulation, and evaluation of its effect on detailed neural morphologies.

Neocortical Microcircuit Collaboration Portal Data Set
In this appendix we discuss the data set and the modifications that we applied to make sure that that training and validation set are completely disjointed.
In Markram et al. (2015), to extend the number of reconstructed models, an algorithm is used to clone morphologies: neural compartments are randomly scaled and rotated with respect to each other.Moreover, morphologies are also stretched and shrunk to make up new morphologies.We identified 54 different morphologies in the data set, listed in the Supplemental Material for this article.Although the cloned and/or scaled morphologies are indeed different than the original ones, their shape is quite similar.
The use of CNNs, which are among the most powerful machine learning algorithms, pushed us to pay particular attention in the training-validation splitting so that no information of the validation set is present in the training set (leakage).Hence, the presence of a cloned/scaled version of the same morphology in both training and validation has been avoided.We selected training and validation cell models so that all morphologies in the validation set are unique.In doing so, we had to remove all instances of BP and NGC from the training set, as all the models are derived from the same reconstructed morphology.For localization and excitatory/inhibitory classification, we kept a BP and an NGC model in the additional validation set.
After the manipulation, the training set consists of 192 cell models, while the validation set only contains 11 cell models, one for each m-type.Moreover, we use one BP and one NGC model, not used for training, as further validation.In total, we included 205 neuronal models out of the available 260.The cell models are listed in the Supplemental Material.

Allen Brain Institute Data Set
From the Allen Brain Institute cell type portal (http://celltypes.brain-map.org/data),we selected cell models according to three criteria: 1) cells were from mice, 2) cells were from L5 (to maintain consistency with the data from the NMC Portal), and 3) cells had an all-active model.This search reduced the number of cell models available to 42.During the simulation process, we further discarded 22 models based on two extra rules: 1) if adjusting the current-clamp amplitude to the soma could not induce a number of spikes between 10 and 30 in 10 iterations (in which the weight was multiplied by 0.75 if the number of intracellular spikes was Ͼ30 and by 1.25 when Ͻ10 spikes were detected) and 2) if Ͻ5 EAP peaks had a peak-to-peak threshold of 30 V in 500 random positioning of the neuron around the probe (meaning that the EAPs were mainly below the defined detection threshold).After this pruning, 20 cell models are left.To distinguish between excitatory and inhibitory cells, we used the transgenic line information: Pvalb, Sst, Htr3, and Gad2 lines were considered inhibitory; Rbp4, Scnn, and Rorb were considered excitatory (Gouwens et al. 2018).After this division there were 11 inhibitory and 9 excitatory cell types.
To avoid overfitting, we randomly selected 4 models, 2 excitatory and 2 inhibitory, and we set them aside for validation, while we used the remaining 16 neuronal models for training.
The cell models are listed in the Supplemental Material.

Kampff Laboratory Data Set
Accompanying their article on paired juxtacellular-extracellular recordings (Neto et al. 2016), the laboratory of Adam Kampff publicly offers the data on http://www.kampff-lab.org/validating-electrodes.To extract an averaged extracellular waveform for each cell that can be fed into a trained CNN, some data processing was necessary.First, we detected spikes in the juxtacellular probe by thresholding the signal to get the cell's spike times.Second, we high-pass filtered the extracellular MEA recording with a third-order Butterworth filter in forward-backward mode with a band pass of 100 -14,250 Hz.Afterwards, we averaged the EAP in windows of 7 ms around the spike times (2 ms preceding and 5 ms after the peak).This average waveform was then referred to as the juxtacellular-triggered average and was used as input for CNN predictions.After this preprocessing, 10 of the 29 available data sets fulfilled the criteria of having a peak-to-peak amplitude of 30 V on at least one electrode and being in front of the extracellular probe (2014_03_26: Pair 2.

APPENDIX B: ADDITIONAL INFORMATION
This appendix contains additional information on parameters and results.
Table B1 shows the specific CNN parameters for different network sizes.
The average localization errors and SDs for different rotational data sets are contained in Table B2, and the corresponding statistical analysis is depicted in Table B3.
Further results on significant differences and effect sizes of localization performances for different CNN sizes, features, MEA probes, and localization methods are listed in Tables B4, B5, B6, and B7, respectively.
The excitatory/inhibitory classification accuracies grouped by cell type for different rotational data sets, CNN sizes, feature sets, and MEA probes are shown in Tables B8, B9, B10, and B11, respectively.Values (in m) are average Ϯ SD errors along x, y, and z dimensions and total errors grouped by rotational data sets.The average of total error is computed over the 3-dimensional distances and is not derived from the mean x, y, and z errors.

Fig. 5 .
Fig. 5. Feature extraction pipeline: first, neuronal models (a pyramidal cell here) are simulated (A) and extracellular action potentials (EAPs; B) are computed on the multielectrode array (MEA) probe (SqMEA-10-15 here).Features (D) are then extracted from EAPs.Localization features are sodium negative peak (Na) and repolarization (positive) peak (Rep), and classification features are peak-to-peak amplitude (A), spike width (W), and full-width half-maximum width (F).The feature images (C) are finally used as input for convolutional neural networks.

F
Fig.6.A: x errors with respect to the x-coordinate (distance).B: y errors with respect to the y-coordinate.C: z errors with respect to the z-coordinate.D: x predictions with respect to the true x-coordinate (distance).E: absolute value of the y predictions with respect to the true y-coordinate.F: absolute value of the z predictions with respect to the true z-coordinate.All values are in m.Orange lines indicate the Norot data set, green lines the Physrot data set, and purple lines the 3drot data set.Gray lines correspond to a perfect prediction.Data are binned in 7 bins along x-, y-, and z -directions: points and error bars display the average errors and their SDs for each bin and each data set.

Fig. 7 .
Fig. 7. A: localization error grouped by cell type for all rotational data sets.Bars are the average errors, and error bars show the SDs in m.Orange bars display the Norot data set, green bars display the Physrot data set, and purple bars display the 3drot data set.Ticks on the x-axis show the cell types: red ticks are inhibitory cells, blue ticks are excitatory, and yellow ticks are bipolar (BP) and neuroglial (NGC) cells (not used for training).B: excitatory/inhibitory classification accuracy in color code grouped by rotation and cell type.Each element of the matrix is the accuracy of a specific cell type (red, inhibitory neurons; blue, excitatory neurons) in the different rotational data sets (rows).For explicit values see TableB8.C: spike shapes for maximum peak electrode sites are plotted in the FW-plane.Red dots are inhibitory neurons, and they lie in the lower left part of the plot.Blue dots show excitatory cells, in the upper right part of the plane.The magenta dots are chandelier cells (ChC), and they lie at the intersection between excitatory and inhibitory cells.The opacity represents the number of occurrences; more opaque dots correspond to more frequent observations.The ellipses are built by following the principal axis of each distribution, and the lengths of their major and minor axes are proportional to the eigenvalues of the covariance matrix.FWHM, full-width half-maximum.

FFig. 8 .
Fig. 8. A, C, and E: error boxplots grouped by convolutional neural network (CNN) size (A), feature type (C), and probe type (E).Red diamonds represent means, and black diamonds denote outliers.B, D, and F: cumulative distribution function (cdf) grouped by CNN size (B), feature type (D), and probe type (F).Gray dashed horizontal line at 0.5 defines the median.Diamonds on each curve show the means (their x values are the average error; the y values are the percentile of their occurrence).The error values are clipped to 20 m to zoom in the distributions.Statistical analyses are shown in TableB4(for A and B), TableB5(for C and D), and TableB6(for E and F).

Fig. 9 .
Fig. 9. A: error boxplots grouped by model used to solve the inverse problem.Red diamonds are means; black diamonds are outliers.B-A-S, ball-and-stick; CNN, convolutional neural network.B: cumulative distribution function (cdf) grouped by localization method.Gray dashed horizontal line at 0.5 defines the median.Diamonds on each curve show the means (their x values are the average error; the y values are the percentile of their occurrence).The error values are clipped to 50 m to zoom in the distributions.probability density functions (pdfs) of predicted values by different models and true values in the x (C)-, y (D)-, and z (E)-directions.Statistical analysis is shown in TableB7.

Fig. 12 .
Fig. 12. Soma position predictions of a convolutional neural network (CNN) based on experimental extracellular action potential (EAP) recordings.Experimental data are from paired juxtacellular-extracellular recordings (Neto et al. 2016) where the position of the soma is associated with the tip position of the juxtacellular probe.The CNN (size L, NaRep feature) is trained on simulated (3drot) EAP signals.Error bars are CNN prediction uncertainties for the predicted coordinates and 4.2 m, 2.8 m, and 8.5 m (misalignment uncertainties reported by the experimenters of Neto et al. 2016) for true x-, y-, and z-coordinates, respectively.

Table 1 .
Inverse model parameters s , r s x , r s y r s z 4 Bipolar I s , r pos x r pos y r pos z , r neg x , r neg y , r neg z 7 Ball and stick I s , r s x , r s y r s z , d x , d y , d z , I dend , d len , d 10 Summary of parameters for the different inverse models involved in the study.I s , current source; r s , predicted soma position; r pos , position of positive I s ; r neg , position of negative I s ; I dend , dendritic current; d len , dendritic length; d , decay constant.

Table 2 .
Model parameter summaryRange for initialization and constraint and standard deviation for mutation Gaussian for the different parameters.I s , current source; I dend , dendritic current; d len , dendritic length; d , decay constant.

Table B1 .
CNN sizesConvolutional neural network (CNN) parameters of the different network sizes: layer 1 convolutional kernel size k 1 , layer 1 convolutional kernel depth d 1 , layer 2 convolutional kernel size k 2 , layer 2 convolutional kernel depth d 2 , and nodes in fully connected layer n FC.

Table B2 .
Localization errors grouped by rotational data set

Table B3 .
Localization by data set rotation: statistical analysis Statistical analysis for localization errors grouped by rotational data set.Each entry shows Cohen's d and significance of the test column group Ͻ row group.***P Ͻ 0.001.ns, Not significant.

Table B4 .
Localization by CNN size: statistical analysis Statistical analysis for localization errors grouped by convolutional neural network (CNN) size.Each entry shows Cohen's d and significance of the test column group Ͻ row group.***P Ͻ 0.001.ns, Not significant.

Table B5 .
Localization by feature type: statistical analysis Statistical analysis for localization errors grouped by feature type.Each entry shows Cohen's d and significance of the test column group Ͻ row group.Na, extracellular action potential (EAP) negative peak (mainly attributed to sodium currents flowing into soma); Rep, EAP positive peak (associated with cell repolarization phase); NaRep, stacked version of Na and Rep; Waveform, downsampled EAP waveforms.***P Ͻ 0.001, **P Ͻ 0.01.ns, Not significant.

Table B7 .
Localization with different methods: statistical analysis Statistical analysis for localization errors grouped by localization method.Each entry shows Cohen's d and significance of the test column group Ͻ row group.B-A-S, ball and stick; CNN, convolutional neural network.***P Ͻ 0.001, *P Ͻ 0.05.ns, Not significant.

Table B6 .
Localization by MEA probe: statistical analysis Statistical analysis for localization errors grouped by probe type.Each entry shows Cohen's d and significance of the test column group Ͻ row group.MEA, multielectrode array.***P Ͻ 0.001, **P Ͻ 0.01.ns, Not significant.