New t-SNE package for R
I finished up a rough version of my t-SNE package for R. If you haven’t heard of t-SNE, and are into dimensionality reduction and/or visualization. You should check it out. The “tsne” package for R is available on CRAN.
t-SNE is a non-metric multidimensional scaler. It creates an embedding by a gradient descent process, much like isoMDS, and a number of other approaches. However, it does some interesting things to the high dimensional data before it proceeds.
First, it translates the matrix of raw numeric values into probabilities. The probabilities are a guess as to the likelihood that another given datapoint is a neighbor. The number of neighbors that are considered is parametrized by a “perplexity” argument, which is the optimal number of neighbors for a given datapoint. A gradient descent attempts to preserve aspects of entropy present in the probability distributions.
The technique handles large variances fairly well across the values in the dimensions or the datapoints themselves, and is fairly quick in the applications that van der Maaten provides. However, my R version is unfortunately much slower.
In the meantime, I’ve been testing it out against various datasets. It does seem to provide a more natural clustering in many cases, such as the classic “iris” flower measurement dataset available in R. I’ve made a simple animation of how the gradient descent process sorts out the clusters and gradually arranges them to minimize its error function:
This produces a better cluster than a comparable PCA technique, based on simple covariance:
I’ve tried it on some other datasets, and it works even better with larger sets of more complex data. I’ll update this post with a link to the package once it goes through the verification process at CRAN.