Pittsburgh Brain Competition

In 2009 I achieved second place in challenge 1 of the Pittsburgh Brain Competition (PBC), Fall 2009. Here I outline my approach.

Challenge Primer

There are three PBC data sets, one for each of three brains. The brains were scanned in an MRI scanner, and, using a technique called diffusion tensor MRI millions of individual axon fibres in the brain can be distinguished and located. These fibres make up the white matter of the brain and act as conduits connecting up the grey matter on the surface layers of the brain where most processing is thought to occur.

Here is a short video that shows a small fraction of the scanned fibres, the fibre colouring is a simple scheme based on fibre midpoint position and orientation and allows distinct bundles of fibres to be seen.

Each brain data set then consists of approximately 250,000 fibres, and for each one a series of 3D points describes the path of the fibre. These 250k fibres represent only a small fraction of the total number, the data set is thinned out to make it more manageable while still remaining representative of the fibre paths, bundles and relative densities fop each bundle. Also note that another level of thinning out is performed to generate the visualizations to allow fast(er) rendering and clearer visualization. Hence the above video shows only a very small fraction of the total number of fibres in the scanned brain.

Brain #1 Data Set Statistics

Statistic	Value
# of fibres (aka tracks).	250,000
Total number of points in all fibres.	19,296,916
Average # of points per fibre.	77.187664
Fewest points in a track.	30
Most points in a track.	251
Combined length of all fibres.	16.21 km
Shortest fibre.	24.62 mm
Longest fibre.	213.19 mm
Shortest distance between adjacent fibre points.	0.8488 mm
Longest distance between adjacent fibre points.	0.8488 mm

So there are 16.21km of fibres in the data. A typical brain will contain a total of around 80,000 - 180,000km of fibres [White matter] depending on subject, gender and age, hence our data set represents somewhere in the region of just 0.01% of all fibres in the scanned brain.

Challenge Goal

The above video visualisation has coloured fibres based on some simple factors such as the position of a fibre's middle and the direction of the fibre at the middle. This is a fairly crude colouring technique that nonetheless is effective at highlighting distinct bundles of fibres, that is, fibres are arranged in bundles with bundle endpoints connecting up two regions of the brain. However, this method of identifying bundles is crude and no match for a human expert. Challenge #1 of the PBC is to find an algorithm that can group fibres into bundles. The metric that determines the quality of a bundling algorithm is based on how similar the bundles it creates are to those determined by a human expert.

Approach

Comparing generated bundles with those chosen by a human expert is simple to implement but clearly leaves the question of what is it that the expert is doing that is better than our best algorithms, and, what what biases are being introduced by a human expert. Do all experts bundle the fibres in the same way? And if not would they each agree that all of their solutions were equally good or would they disagree?

A better approach might be to define a mathematical definition of bundling quality based on an inter-fibre distance metric and minimizing inter-fibre distance variance in each bundle. In fact this was my approach to automating the bundling process, bundles are essentially clusters of fibres that are close together and as known clustering algorithms can be applied. I chose to use k-means clustering as it is simple to implement whilst typically yielding results that are similar in quality to more complex methods.

To use k-means we need a distance metric, that is, for any given two fibres we need a means of measuring the distance between them. Some approaches are:

Distance between fibre mid-points. This ignores fibre orientation, e.g. the fibres may follow entirely different paths but happen to cross over in the middle.
Combined distance between endpoints. This poses the problem of which endpoints should be compared, e.g. if we have fibres A and B with ends 1 and two then should we compare A1,B1 + A2,B2 or A1,B2 + A2,B1. This is easily resolved by performing both comparisons and selecting the shortest distance. An additional problem here is that bundle ends to splay out, like the end of a frayed piece of string. Thus two fibres may be very closely aligned for the majority of their lengths and then diverge significantly at the ends.
Loop over each point in fibre 1 and for each point find the closest point on fibre 2. Take the average distance between points. This approach reasonably well but is slow. However it can be sped up without significant loss of quality by comparing only a subset of representative points along the length of fibre 1.
I further refined (3) as follows. Translate both fibres so that their mid points are co-located (e.g. at position 0,0,0). Comparing the fibres as in (3) now gives us a fibre orientation similarity measure. And the distance between their mid points (prior to translation) gives us a distance measure. We have effectively divided (3) into two separate measured that can now be weighted. By trial and error a weighting was found that performed better than (3) at funding bundles that similar to the expertly chosen ones (the challenge scoring method). Motivation for this approach was the observation that some bundles are elongate surfaces/manifolds rather than compact sphere shapes, e.g. one bundle was made up of horseshoe shaped fibres arranged as a long half-tube shape; Thus fibres at either end of the tube are actually distant from each other but have very similar shape and orientation. This is the classic problem in clustering of non-spherical clusters, e.g. sausage shaped clusters are common, and our half-tube bundle does in fact describe a sausage-like shape in the space described by our distance metric.

Approach (4) still contains some crude elements most notably translating the fibres based on their mid points (as determined by point position in list of the fibre points), however this works reasonably well, partly because points are more or less equally distanced along the fibres, thus the mid point in the list of points will generally be very close to the actual mid point along a fibre's length.

K-medoids

k-medoids is a variant on k-means in which the cluster centres (centroids) are restricted to be one of the data points in the cluster, as opposed to the coordinate that represents the centre of all the data points in the cluster. This is useful if no convenient coordinate system is available for representing such a mid-point. My approach falls into this category, I have a distance metric that defines a distance between fibres but no actual coordinate system to represent the positions of fibres in that space. As such we can define one of the fibres as the centroid and this then becomes known as a medoid, thus my algorithm is actually k-medoids. The centroid fibre is defined simply as the one that has the shortest average squared distance to all other fibres in the cluster, this selects the fibre that results in the lowest variance for the cluster (the lowest squared distance from each fibre in the bundle to the bundle centre/medoid).

Challenge Scoring

For challenge #1 scoring the identified bundles of fibres are compared to those identified by an expert. For comparison bundles were numbered and initially you had to be careful to assign the right number to the submitted bundles. Later the scoring was changed to automatically select bundles for comparison by selecting the submitted bundle closest to each of the expert's bundles (with the caveat that a submitted bundle can't be selected more than once).

Scoring for each bundle was based on how many fibres you had correctly classified and misclassified. So in a submitted bundle there would be H correctly classified fibres (bundle 'hits') and M misclassified fibres (bundle 'misses'). The score for the bundle is then given by:

score = (H - M) / T

Where T is the total number of fibres in the expert's bundle. Hence max score per bundle is 1.0. Scores below 0 are truncated to 0. The challenge score was the average score for all bundles, of which there were eight.

Very possibly this scoring system can be improved, but for the purposes of the competition it worked well enough.

Results

I achieved 2nd place with a score of 0.3347 (ICDM PBC Fall 2009 Results - Summary Tables). This was some way behind the winner Vladimir Nikulin with a score of 0.5037. Nikulin detailed his approach in "Identifying fibre bundles with regularised k-means clustering applied to the grid-based data", Vladimir Nikulin, Department of Mathematics, University of Queensland.

Interestingly Nikulin also used k-means clustering but fairly crucially used a version with a form of regularization that attempts to create more stable clusters by avoiding very small unstable clusters and also very large clusters. Nikulin...

...There are two major problems here: stability of clustering and meaningfulness of centroids as cluster representatives. On the one hand, big clusters impose strong smoothing and possible loss of very essential information. On the other hand, small clusters are, usually, very unstable and noisy. Accordingly, they can not be treated as equal and independent represen- tatives. To address the above problems, we applied regularisation to prevent the creation of super big clusters, and to attract data to existing small clusters.

There appears to be good motivation for use of a regularized form of k-means in high dimensional data and the performance in this challenge is promising. For me this was probably the main lesson to be taken from this challenge.

Regularized k-means

Consider applying k-means to points in Euclidean space. Typically we take the Euclidean distance between two points, so for point p and q we have:

D = || p - q ||

What Nikulin did was to add a regularization term to the distance, so that we get:

D = || p - q || + R_c
R _c = (α·L·#c) / m

Where:

α - Regularization constant.
L - Maximum distance between any point and any of the current centroids.
#c - Membership size of cluster c.
m - Total number of all points.

So there is some degree of punishment against cluster size which would e.g. result in points that lie equidistant between clusters become assigned to the smaller cluster. L appears to be constant during any one iteration thus it seems to act as a means of reducing the regularisation term as clusters become more stable; perhaps a similar effect could be achieved by gradually reducing α?

From REGULARISED k-MEANS CLUSTERING FOR DIMENSION REDUCTION APPLIED TO SUPERVISED CLASSIFICATION (pdf), Vladimir Nikulin, Geoffrey J. McLachlan:

Regularised k-means clustering Stability in cluster analysis is strongly dependent on the data set, especially, on how well separated and how homogeneous the clusters are. Stability is a very important aspect in cluster analysis. Stability means that a meaningful valid cluster should not disappear easily if the data set is changed in a non-essential way [12]. On the one hand, big clusters impose strong smoothing and possible loss of very essential information. On the other hand, small clusters are, usually, very unstable and noisy. Accordingly, they can not be treated as equal and independent representatives. The target of the following below regularisation is to prevent creation of super big clusters, and to attract data to existing small clusters.

Also, this is somewhat related to fuzzy k-means and Expectation Maximization, etc. but with more efficient algorithmic time complexity.

Colin,
(some time in 2009)