Creating the Toolkit

Creating a machine learning toolkit [5/21 - 6/9, 3 weeks]

Study the dataset to determine statistical information. Generate a 2D plot for each pair of features, color-coded by cluster.
Implement Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) as a performance measure.
Test out different kernels and tune parameters.
Test the algorithm on 2 datasets: SAS (high-latitude) and a midlatitude radar. Document and debug problems.
Create a poster for the SuperDARN workshop (6/3 thru 6/8). Use lots of graphs with good descriptions. Due 5/30.
Clean up code
Create a GitHub toolkit with the code.
Produce online documentation.

Work Log

Study the dataset to determine statistical information (Python StatsModel). [done]

See the poster and slides for graphs, or contact me for more graphs. Velocity and spectral width appear Gaussian, but other features such as phi0, power, and beam are not at all Gaussian. GMM is doing a decent job regardless, but it will always pull all the outlier points into one big high-variance cluster. We need to figure out how to deal with the outliers: either using transformations to squash the data such as BoxCox, or figuring out how to get rid of noise (PCA has not yet done the trick, you still see the noise grouped in to 1 cluster, but maybe with the right parameters / features PCA would do it).

Implement Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) as a performance measure. [done]

With all 7 features and on all 16 beams, AIC and BIC will just keep decreasing (low score means better model) up to 100 clusters, where it becomes way too computationally expensive to be practical. However, with just 1-2 of the most Gaussian looking features, AIC and BIC show the best model fit for 5-10 clusters.

Some of the features look very Gaussian, like velocity and spectral width, and some don't like beam and power.

5-10 clusters seems to make the most physical sense, where 100 seems like it is overfitting. To make the Gaussian model make sense and not use a huge number of clusters to try to fit something not meant to be fit by a Gaussian, we are considering dropping the features that do not look Gaussian. Alternatively, they could be transformed using a method like Box Cox.

Model Selection[will need more work]

Rather than testing kernels (more relevant to Gaussian Process than GMM), I implemented forwards selection to do model selection. However, the results of this are hard to interpret, so this will require more time. The features that make the most 'physical sense' are not the ones being chosen by forward selection - probably because it is just choosing the features that it can best fit with a Gaussian, rather than the ones that might most likely correspond to the IS and GS labels. Backwards selection may tell us more, as it has some advantages over forwards selection.

Test the algorithm on 2 datasets: SAS (high-latitude) and a midlatitude radar. Document and debug problems.[done]

The poster above shows a comparison of SAS (high-latitude) vs. CVW (mid-latitude) on Feb 7 2018 (good data, no dual-frequency on this day). They are both doing better than the traditional model, but on CVW on that day some IS is being misclassified as GS on other beams that are not shown there (you can see evidence of this in Figure 10 at range gate 20 - there should not be a bump in both IS and GS in the same range gate).

Next steps here is to study a few days of data, and figure out a better evaluation criteria than what we are currently using (median |velocity| > 15 m/s). Some options are to use a ratio of high:low velocity like in Ribeiro et. al., or to base it on velocity and spectral width like the traditional method does.

Create a poster for the SuperDARN workshop (6/3 thru 6/8). Use lots of graphs with good descriptions. Due 5/30. [done]

Poster for SuperDARN workshop 2018

Clean up code. Put in a virtual environment. [done / will need more work]

Python 3 virtual environment is now set up, and installation is pretty easy within a virtual environment. The code needs more work, but that will have to happen at the end of the summer, once the project is more complete and I know what kind of structure everything will be in.

Create a GitHub toolkit with the code. [done]

See main page for a link.

Produce online documentation. [done]

Documentation is available here and on the Github page (link on front page). All functions within the superdarn_cluster module are documented with docstrings, but the plotters are not well documented because they are likely to change.