Last updated: 2018-07-25
Code version: 2e33ca2
The main purpose of this analysis is to determine the optimum number of clusters.
%matplotlib inline
from os import getcwd, chdir
wd = getcwd()
chdir(wd + "/../py")
import perform_kmeans as pkm
import matplotlib.pyplot as plt
import os
import pandas as pd
import plot_segment
import find_opt_nclusters_seg_length
import find_n_neighbours as n_neigh
import cluster_analysis
import produce_labelled_cluster_files as pcf
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
First lets calculate the silhouette scores for each segment length and n clusters = 1,2, ..., 10 (note please ignore that I am using import in the code here to do this - I know this is not how I should do this - this is my first Jupyter notebook!)
find_opt_nclusters_seg_length.perform_silhouette_analysis(path = "../../",
segment_lengths = [100,150,200,250,300],
range_nclusters = range(2,11),
plot_silhouette = False)
Something intriguing is that the silhouette value has multiple modes
We will now plot the 6 nearest segments to each cluster centre for the optimum silhouette score of each segment length.
n_nearest_neighbours = 6
plt.figure(1, figsize=(12,10))
cluster_analysis.plot_n_nearest(seg_length = 100, n_clusters = 5, n_nearest_neighbours = n_nearest_neighbours)
This looks reasonably good. There are distinct differences between each of the 5 clusters.
plt.figure(1, figsize=(12,12))
cluster_analysis.plot_n_nearest(seg_length = 100, n_clusters = 6, n_nearest_neighbours = n_nearest_neighbours)
Still not too bad, perhaps their is some abiguity between clusterID 2 and ClusterID 4. ClusterID 4 is perhaps a little more attached to being at the edge of the arena and a little less spread out than ClusterID 2.
Let's look at some longer segment lengths now.
Before we do though, its probably worth making the point, that given the way the experiment was conducted, I think its unlikely that we will see consistent repetive patterns that occur over longer periods, at least patterns such as those that move around the edge, because the bee is reacting to a stimulus at regular intervals that disrupt the pattern.
plt.figure(1, figsize=(12,10))
cluster_analysis.plot_n_nearest(seg_length = 150, n_clusters = 5, n_nearest_neighbours = n_nearest_neighbours)
This is actually not too bad. Visually, all clusters are different from one another from one another but ther is good consistency within each cluster.
Let's try for another cluster.
plt.figure(1, figsize=(12,12))
cluster_analysis.plot_n_nearest(seg_length = 150, n_clusters = 6, n_nearest_neighbours = n_nearest_neighbours)
For me here, its starting to get difficult to be clear about whether the additional cluster is helping. Let's try even longer segments. In fact lets take a look at the longest of 300 and see if we can see a contrast.
plt.figure(1, figsize=(12,4))
cluster_analysis.plot_n_nearest(seg_length = 300, n_clusters = 2, n_nearest_neighbours = n_nearest_neighbours)
I think its very interesting to see the path consistently tracking the edge for clusterID 1. What is happening in clusterID 0 though? I would describe this as wandering behaviour. Let's have a look at an additional cluster for this length.
plt.figure(1, figsize=(12,6))
cluster_analysis.plot_n_nearest(seg_length = 300, n_clusters = 3, n_nearest_neighbours = n_nearest_neighbours)
There does seem to be some difference between cluster ID 0 and 2. ID 0 covers a greater area. Other than that though its hard to be clear what the differences are.
Now let's look at length 200 which has the 2nd best silhouette value of all clusters when k = 5.
plt.figure(1, figsize=(12,10))
cluster_analysis.plot_n_nearest(seg_length = 200, n_clusters = 5, n_nearest_neighbours = n_nearest_neighbours)
Although there is less hugging of the boundary by cluster ID 0 than cluster ID 4, the two are not clearly different from one another. I think perhaps one less cluster would be better here and might possible provide more distinct clusters. Let's try and see what we get.
plt.figure(1, figsize=(12,8))
cluster_analysis.plot_n_nearest(seg_length = 200, n_clusters = 4, n_nearest_neighbours = n_nearest_neighbours)
I think this is a pretty good clustering and looks to have merged the two similar clusters we found previously
For completeness I think we should also look at the length 250. Here 4 clusters had the optimum silhouette value.
plt.figure(1, figsize=(12,8))
cluster_analysis.plot_n_nearest(seg_length = 250, n_clusters = 4, n_nearest_neighbours = n_nearest_neighbours)
I believe there are several candidates for a good length and number of clusters. These are:
Although there are clear distinctions between segments in the shorter length segments, I'm unsure that they contain enough information to deem them strategies.
The 300mm & 2 clusters initially might not seem like a great choice since we aren't identifying many types of behaviour. That being said, I think that bee behaviour is going to be more simplistic than the rodent behaviour and so perhaps this is the best we can do. There's also a chance that when we are find more clusters for other segment lengths, what we are finding is just regular movement patterns that don't indicate a strategy. This line of thinking however, is far from conclusive.
While the silhouette score has helped to guide the process, here we can see that using it blindly particularly if we wanted to automate the process, might possibly lead to a less optimum clustering regime. This might be because the features have been unable to capture the behaviour accurately.
To plot the proportion of the trajectory that exhibited the behaviour associated with about clustering regimes, we need additional files. These files are then fed into a couple of R scripts to produce the plots.
pcf.create_file_with_cluster_membership(seg_length = 100, n_clusters = 5)
pcf.create_file_with_cluster_membership(seg_length = 150, n_clusters = 5)
pcf.create_file_with_cluster_membership(seg_length = 200, n_clusters = 4)
pcf.create_file_with_cluster_membership(seg_length = 250, n_clusters = 4)
pcf.create_file_with_cluster_membership(seg_length = 300, n_clusters = 2)
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.3 backports_1.1.1 magrittr_1.5 rprojroot_1.3-2
[5] tools_3.4.3 htmltools_0.3.6 yaml_2.1.16 Rcpp_0.12.13
[9] stringi_1.1.6 rmarkdown_1.8 knitr_1.17 git2r_0.21.0
[13] stringr_1.3.0 digest_0.6.12 evaluate_0.10.1
This R Markdown site was created with workflowr