Last updated: 2018-07-25

Code version: 2e33ca2

2018-01-11-mp-Cluster-Analysis

The main purpose of this analysis is to determine the optimum number of clusters.

In [2]:
%matplotlib inline

from os import getcwd, chdir
wd = getcwd()
chdir(wd + "/../py")

import perform_kmeans as pkm
import matplotlib.pyplot as plt
import os
import pandas as pd
import plot_segment
import find_opt_nclusters_seg_length
import find_n_neighbours as n_neigh
import cluster_analysis
import produce_labelled_cluster_files as pcf
from IPython.display import HTML
In [7]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Out[7]:
The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

First lets calculate the silhouette scores for each segment length and n clusters = 1,2, ..., 10 (note please ignore that I am using import in the code here to do this - I know this is not how I should do this - this is my first Jupyter notebook!)

In [4]:
find_opt_nclusters_seg_length.perform_silhouette_analysis(path = "../../",
                            segment_lengths = [100,150,200,250,300],
                            range_nclusters = range(2,11),
                            plot_silhouette = False)
Segment Length = 100
n_clusters = 2   Avg silhouette_score = 0.317796729966
n_clusters = 3   Avg silhouette_score = 0.341687140808
n_clusters = 4   Avg silhouette_score = 0.346224832219
n_clusters = 5   Avg silhouette_score = 0.357602967248
n_clusters = 6   Avg silhouette_score = 0.348248690983
n_clusters = 7   Avg silhouette_score = 0.333365674623
n_clusters = 8   Avg silhouette_score = 0.329332246855
n_clusters = 9   Avg silhouette_score = 0.329672774908
n_clusters = 10   Avg silhouette_score = 0.316093813797
Segment Length = 150
n_clusters = 2   Avg silhouette_score = 0.30115083885
n_clusters = 3   Avg silhouette_score = 0.323050538313
n_clusters = 4   Avg silhouette_score = 0.327693533818
n_clusters = 5   Avg silhouette_score = 0.351443626241
n_clusters = 6   Avg silhouette_score = 0.318577604574
n_clusters = 7   Avg silhouette_score = 0.31176763658
n_clusters = 8   Avg silhouette_score = 0.314112432766
n_clusters = 9   Avg silhouette_score = 0.29649945135
n_clusters = 10   Avg silhouette_score = 0.295541162504
Segment Length = 200
n_clusters = 2   Avg silhouette_score = 0.292174348124
n_clusters = 3   Avg silhouette_score = 0.331356647433
n_clusters = 4   Avg silhouette_score = 0.336193954096
n_clusters = 5   Avg silhouette_score = 0.356969107398
n_clusters = 6   Avg silhouette_score = 0.298492772502
n_clusters = 7   Avg silhouette_score = 0.306013368967
n_clusters = 8   Avg silhouette_score = 0.299447797526
n_clusters = 9   Avg silhouette_score = 0.296718574845
n_clusters = 10   Avg silhouette_score = 0.295691417276
Segment Length = 250
n_clusters = 2   Avg silhouette_score = 0.338817022163
n_clusters = 3   Avg silhouette_score = 0.340287721084
n_clusters = 4   Avg silhouette_score = 0.345053475954
n_clusters = 5   Avg silhouette_score = 0.291442119645
n_clusters = 6   Avg silhouette_score = 0.308546452992
n_clusters = 7   Avg silhouette_score = 0.301440664635
n_clusters = 8   Avg silhouette_score = 0.295342765367
n_clusters = 9   Avg silhouette_score = 0.259745996504
n_clusters = 10   Avg silhouette_score = 0.259865231627
Segment Length = 300
n_clusters = 2   Avg silhouette_score = 0.370702541521
n_clusters = 3   Avg silhouette_score = 0.333636137157
n_clusters = 4   Avg silhouette_score = 0.337334341625
n_clusters = 5   Avg silhouette_score = 0.324925840586
n_clusters = 6   Avg silhouette_score = 0.305482805773
n_clusters = 7   Avg silhouette_score = 0.269850826056
n_clusters = 8   Avg silhouette_score = 0.269353656552
n_clusters = 9   Avg silhouette_score = 0.272970477034
n_clusters = 10   Avg silhouette_score = 0.273360390918

Something intriguing is that the silhouette value has multiple modes

We will now plot the 6 nearest segments to each cluster centre for the optimum silhouette score of each segment length.

In [5]:
n_nearest_neighbours = 6

100mm segments and 5 clusters

In [6]:
plt.figure(1, figsize=(12,10))
cluster_analysis.plot_n_nearest(seg_length = 100, n_clusters = 5, n_nearest_neighbours = n_nearest_neighbours)

This looks reasonably good. There are distinct differences between each of the 5 clusters.

100mm segments and 6 clusters

In [24]:
plt.figure(1, figsize=(12,12))
cluster_analysis.plot_n_nearest(seg_length = 100, n_clusters = 6, n_nearest_neighbours = n_nearest_neighbours)

Still not too bad, perhaps their is some abiguity between clusterID 2 and ClusterID 4. ClusterID 4 is perhaps a little more attached to being at the edge of the arena and a little less spread out than ClusterID 2.

Let's look at some longer segment lengths now.

Before we do though, its probably worth making the point, that given the way the experiment was conducted, I think its unlikely that we will see consistent repetive patterns that occur over longer periods, at least patterns such as those that move around the edge, because the bee is reacting to a stimulus at regular intervals that disrupt the pattern.

150mm segments and 5 clusters

In [7]:
plt.figure(1, figsize=(12,10))
cluster_analysis.plot_n_nearest(seg_length = 150, n_clusters = 5, n_nearest_neighbours = n_nearest_neighbours)

This is actually not too bad. Visually, all clusters are different from one another from one another but ther is good consistency within each cluster.

Let's try for another cluster.

150mm segments and 6 clusters

In [13]:
plt.figure(1, figsize=(12,12))
cluster_analysis.plot_n_nearest(seg_length = 150, n_clusters = 6, n_nearest_neighbours = n_nearest_neighbours)

For me here, its starting to get difficult to be clear about whether the additional cluster is helping. Let's try even longer segments. In fact lets take a look at the longest of 300 and see if we can see a contrast.

300mm segments and 2 clusters

In [14]:
plt.figure(1, figsize=(12,4))
cluster_analysis.plot_n_nearest(seg_length = 300, n_clusters = 2, n_nearest_neighbours = n_nearest_neighbours)

I think its very interesting to see the path consistently tracking the edge for clusterID 1. What is happening in clusterID 0 though? I would describe this as wandering behaviour. Let's have a look at an additional cluster for this length.

300mm segments and 3 clusters

In [15]:
plt.figure(1, figsize=(12,6))
cluster_analysis.plot_n_nearest(seg_length = 300, n_clusters = 3, n_nearest_neighbours = n_nearest_neighbours)

There does seem to be some difference between cluster ID 0 and 2. ID 0 covers a greater area. Other than that though its hard to be clear what the differences are.

Now let's look at length 200 which has the 2nd best silhouette value of all clusters when k = 5.

200mm segments and 5 clusters

In [16]:
plt.figure(1, figsize=(12,10))
cluster_analysis.plot_n_nearest(seg_length = 200, n_clusters = 5, n_nearest_neighbours = n_nearest_neighbours)

Although there is less hugging of the boundary by cluster ID 0 than cluster ID 4, the two are not clearly different from one another. I think perhaps one less cluster would be better here and might possible provide more distinct clusters. Let's try and see what we get.

200mm segments and 4 clusters

In [17]:
plt.figure(1, figsize=(12,8))
cluster_analysis.plot_n_nearest(seg_length = 200, n_clusters = 4, n_nearest_neighbours = n_nearest_neighbours)

I think this is a pretty good clustering and looks to have merged the two similar clusters we found previously

For completeness I think we should also look at the length 250. Here 4 clusters had the optimum silhouette value.

250mm segments and 4 clusters

In [18]:
plt.figure(1, figsize=(12,8))
cluster_analysis.plot_n_nearest(seg_length = 250, n_clusters = 4, n_nearest_neighbours = n_nearest_neighbours)

Conclusion

I believe there are several candidates for a good length and number of clusters. These are:

  • 100mm & 5 clusters
  • 150mm & 5 clusters
  • 200mm & 4 clusters
  • 250mm & 4 clusters
  • 300mm & 2 clusters

Although there are clear distinctions between segments in the shorter length segments, I'm unsure that they contain enough information to deem them strategies.

The 300mm & 2 clusters initially might not seem like a great choice since we aren't identifying many types of behaviour. That being said, I think that bee behaviour is going to be more simplistic than the rodent behaviour and so perhaps this is the best we can do. There's also a chance that when we are find more clusters for other segment lengths, what we are finding is just regular movement patterns that don't indicate a strategy. This line of thinking however, is far from conclusive.

While the silhouette score has helped to guide the process, here we can see that using it blindly particularly if we wanted to automate the process, might possibly lead to a less optimum clustering regime. This might be because the features have been unable to capture the behaviour accurately.

Produce the files for further analysis

To plot the proportion of the trajectory that exhibited the behaviour associated with about clustering regimes, we need additional files. These files are then fed into a couple of R scripts to produce the plots.

In [30]:
pcf.create_file_with_cluster_membership(seg_length = 100, n_clusters = 5)
pcf.create_file_with_cluster_membership(seg_length = 150, n_clusters = 5)
pcf.create_file_with_cluster_membership(seg_length = 200, n_clusters = 4)
pcf.create_file_with_cluster_membership(seg_length = 250, n_clusters = 4)
pcf.create_file_with_cluster_membership(seg_length = 300, n_clusters = 2)

Session information

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.4.3  backports_1.1.1 magrittr_1.5    rprojroot_1.3-2
 [5] tools_3.4.3     htmltools_0.3.6 yaml_2.1.16     Rcpp_0.12.13   
 [9] stringi_1.1.6   rmarkdown_1.8   knitr_1.17      git2r_0.21.0   
[13] stringr_1.3.0   digest_0.6.12   evaluate_0.10.1

This R Markdown site was created with workflowr