How to remove noise from a set of data points assuming the data is normal but the noise is uniform? - machine-learning

I have a load of points inside some bounded rectangle of the plane. Most of them follow one of n bivariate normal distributions (the number n is unknown) but a pretty small amount of the remaining points instead follow one uniform distribution across the entire rectangle. I’m even willing to consider both the cases of when I have an estimate of how many of the points are noise, but prefer a solution which is agnostic of this.
In this image there are two gaussians and the red points are the uniform noise I want to filter out. Note that I’ve drawn this by hand so the good points might not look properly Gaussian. But in a real instance they will be!
I want to filter out that uniform noise so that I only have a mixture of gaussians left. With my assumption of normality, is there a fairly robust solution?
I have been thinking of using DBSCAN as a cleanup step to remove all of the noise but obviously have that problem of picking parameters.
I currently use GMMs to cluster my data and then some of the uniform noise ends up in its own clusters with massive, crazy covariance matrices that seem to go way outside of the rectangle. But I don’t know a robust way of choosing which clusters are the noisy ones and which are the true gaussians.
It seems I want a measure of density of the detected clusters. Or to relate the number of points with the area of the confidence region, as this ratio will be more exaggerated in the uniform cases.
Are there any papers on similar problems?

I think there are many ways to solve this problem. Here is a simple approach that just uses k-means followed by a normaltest for all the points in each cluster. The parameters that need to be set are the maximum number of clusters ever expected to be seen and a cutoff for maximum p-value for to the normal test. For each cluster, we do a normal test to see if the data points in the clusters roughly follow a normal distribution. Then, we only keep clusters (and data points) that are below the cutoff.
One risk of this approach is that some of the uniformly distributed noise points can really skew the distribution of the real clusters. Therefore, you may want to do some preprocessing to remove points that are too far away from a cluster center.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from scipy.stats.mstats import normaltest
def gen_2d_normal(n):
mean = np.random.uniform(0, 1, 2)
cov = 1e-3*np.eye(2)
return np.random.multivariate_normal(mean, cov, n)
# used for generation
num_clusters = 3
num_points = 100
num_noise_points = 300
# hyperparameters to set
max_clusters = 8
# generate data
x, y = [], []
for cluster_idx in range(num_clusters):
x.append(np.random.uniform(0, 1, 2*num_noise_points).reshape(-1, 2))
x = np.vstack(x)
y = np.array(y)
clf = KMeans(max_clusters)
y_pred = clf.fit_predict(x)
for idx in range(max_clusters):
plt.scatter(x[y_pred == idx][:, 0], x[y_pred == idx][:, 1], s=4)
_, p_vals = normaltest(x[y_pred == idx])
plt.text(clf.cluster_centers_[idx, 0]-0.04,
clf.cluster_centers_[idx, 1]-0.07,
'{:.2f}'.format(p_vals.mean()), fontsize=14)
plt.scatter(clf.cluster_centers_[:, 0], clf.cluster_centers_[:, 1],
s=34, marker='X', color='k')

Since Gaussians are infinite, you should expect them to go out of the rectangle. On such data, I have seen EM work well with a 'noise' Gaussian that collects all the noise points, and where you would outlier ignore the model. You should be able to easily detect then because of their high variance and low density.
But the DBSCAN approach sounds very reasonable to try.


Query about SVM mapping of input vector? And SVM optimization equation

I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.
So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?
Second question is about the following optimization equation:
min 1/2 wi(transpose)*wi + C Σi = 1..n ξi
so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.
A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.
When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.
To be concrete, let's talk about the kernel
K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,
where x and y are each real numbers (one-dimensional). Note that
(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1
has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.
The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.
You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)
The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).
1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.
The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.
2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.
See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).

Scikit_learn's PolynomialFeatures with logistic regression resulting in lower scores

I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).
I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?
Here is the code I'm using,
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)
from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)
array([ 0.46418338, 0.4269341 , 0.49425287, 0.58908046, 0.60518732])
Which makes me suspect, I'm doing something wrong.
I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.
UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
So I didn't bother going this route.
The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.
Thank you.
Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside #MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.
There is a statement that "the accuracy is supposed to increase" with polynomial features. That is true if the polynomial features brings the model closer to the original data generating process. Polynomial features, especially making every feature interact and polynomial, may move the model further from the data generating process; hence worse results may be appropriate.
By using a 3 degree polynomial in scikit, the X matrix went from (1741, 61) to (1741, 41664), which is significantly more columns than rows.
41k+ columns will take longer to solve. You should be looking at feature selection methods. As Grr says, investigate lowering the polynomial. Try L1, grouped lasso, RFE, Bayesian methods. Try SMEs (subject matter experts who may be able to identify specific features that may be polynomial). Plot the data to see which features may interact or be best in a polynomial.
I have not looked at it for a while but I recall discussions on hierarchically well-formulated models (can you remove x1 but keep the x1 * x2 interaction). That is probably worth investigating if your model behaves best with an ill-formulated hierarchical model.

Math: Is this k-means clustering?

Hello! I have some points on a line. These points do not have an Y dimension, only an X dimension. I only placed them in an Y dimension because this wanted to be able to place multiple dots on the same spot.
I would like find n centroids (spots with the most density).
I placed for example centroids (=green lines) to show what I mean. These examplary centroids were not calculated, I only placed them guessing where they would be.
Before I dive into the math, I would like to know if this is can be solved with k-means-clustering, or if I am going in the wrong direction.
Thank you.
K-means is rather sensitive to noise, and you seem to have a lot of noise. But yes, it may work to some extend. Plus, it does not exploit that your data is just 1 dimensional.
However, it sounds to me as if you want to do some very primitive mode seeking. In 1D, the most appropriate approach for you is Kernel Density Estimation, and then choose the local density maxima.
"Cluster analysis" sure sounds much more fancy, but nevertheless classical statistic "KDE" will likely produce much better results. In particular, you don't have to fix "k" beforehand, and it will be much more robust wrt. noise.
You can use K-means and actually the implementation is so easy:
Select the number of clusters you want
Select k points randomly (You can repeat this for avoiding local optimum)
Find the distance of each other points to these k centers
Assign the points to the nearest center
For each set of points calculate the average
If the average is changing, move the center of the clusters to the new averages and go to 3
Otherwise finish
Or you can use matlab to do this for you:
k = 2;
rng('default') % For reproducibility
X = [randn(100,1)+ones(100,1);...
opts = statset('Display','final');
[idx,ctrs] = kmeans(X,k,'Distance','city','Replicates',5,'Options',opts);
hold on
legend('Cluster 1','Cluster 2','Centroids','Location','NW')
hold off
I put the result in diagonal to show it better, but the real data is 1D:

Math Graph to Equation

Is there a tool which converts a graphical representation of an equation to that equation? (Graphical representation to aprox. math equation)
This is a tricky problem, usually referred to as interpolation. For simple polynomial graphs it's an easy problem. (You can always find an "exact match".) Have a look at polynomial interpolation. But you could also have a graph that represents some trigonometric function. Or how about exponential functions or logarithmic functions. Or worse, combinations! Even for simple graphs there may be thousands of interesting potential equations.
Even if you do check all interesting equations, you should still be careful. Consider the equation y = A * sin(B*x), with extremely large values for A and B. How does that graph look? Well, it goes up and down between A and -A over and over again, really really fast, and "hits" or "almost hits" just about all points. It's a "simple" formula, which mathematically looks like a good approximation, but it's still most likely not something you would want in the end.
A common problem that might fit your description is called curve fitting: you have some data (that, in your case, you've read from a graph) and you have in mind a form of an equation, and you want to find what parameters you need to best fit the equation to the graph.
A useful approach to this is fit to least squares error. A least squares package will be available in most data analysis tool kits.
Here's an example: Say the equation is A*sin(2*pi*100.x)*x^B, and I need to find the values of A and B that give me the best fit (A=10.0 and B=3.0 in this example).
alt text
Here's the code used to generate this fit. It uses Python and Scipy and is modified from an the example here.)
from numpy import *
from scipy.optimize import leastsq
import matplotlib.pyplot as plt
def my_func(x, p): # the function to fit (and also used here to generate the data)
return p[0]*sin(2*pi*100.*x)*x**p[1]
# First make some data to represent what would be read from the graph
p_true = 10., 3.0 # the parameters used to make the true data
x = arange(.5,.5+12e-2,2e-2/60)
y_true = my_func(x, p_true)
y_meas = y_true + .08*random.randn(len(x)) # add some noise to make the data as read from a graph
# Here's where you'd start for reading data from a graph
def residuals(p, y, x): # a function that returns my errors between fit and data
err = y - my_func(x, p)
return err
p0 = [8., 3.5] # some starting parameters to my function (my initial guess)
plsq = leastsq(residuals, p0, args=(y_meas, x)) # do the least squares fit
# plot the results
plt.plot(x, my_func(x, plsq[0]), x, y_meas, '.', x, y_true)
plt.title('Least-squares fit to curve')
plt.legend(['Fit', 'Graph', 'True'])
I've seen some tools that fit equations to graphs in images but I can't recall their names right now. A quick Google search turned up this commercial application:

Geometric representation of Perceptrons (Artificial neural networks)

I am taking this course on Neural networks in Coursera by Geoffrey Hinton (not current).
I have a very basic doubt on weight spaces.
Page 18.
If I have a weight vector (bias is 0) as [w1=1,w2=2] and training case as {1,2,-1} and {2,1,1}
where I guess {1,2} and {2,1} are the input vectors. How can it be represented geometrically?
I am unable to visualize it? Why is training case giving a plane which divides the weight space into 2? Could somebody explain this in a coordinate axes of 3 dimensions?
The following is the text from the ppt:
1.Weight-space has one dimension per weight.
2.A point in the space has particular setting for all the weights.
3.Assuming that we have eliminated the threshold each hyperplane could be represented as a hyperplane through the origin.
My doubt is in the third point above. Kindly help me understand.
It's probably easier to explain if you look deeper into the math. Basically what a single layer of a neural net is performing some function on your input vector transforming it into a different vector space.
You don't want to jump right into thinking of this in 3-dimensions. Start smaller, it's easy to make diagrams in 1-2 dimensions, and nearly impossible to draw anything worthwhile in 3 dimensions (unless you're a brilliant artist), and being able to sketch this stuff out is invaluable.
Let's take the simplest case, where you're taking in an input vector of length 2, you have a weight vector of dimension 2x1, which implies an output vector of length one (effectively a scalar)
In this case it's pretty easy to imagine that you've got something of the form:
input = [x, y]
weight = [a, b]
output = ax + by
If we assume that weight = [1, 3], we can see, and hopefully intuit that the response of our perceptron will be something like this:
With the behavior being largely unchanged for different values of the weight vector.
It's easy to imagine then, that if you're constraining your output to a binary space, there is a plane, maybe 0.5 units above the one shown above that constitutes your "decision boundary".
As you move into higher dimensions this becomes harder and harder to visualize, but if you imagine that that plane shown isn't merely a 2-d plane, but an n-d plane or a hyperplane, you can imagine that this same process happens.
Since actually creating the hyperplane requires either the input or output to be fixed, you can think of giving your perceptron a single training value as creating a "fixed" [x,y] value. This can be used to create a hyperplane. Sadly, this cannot be effectively be visualized as 4-d drawings are not really feasible in browser.
Hope that clears things up, let me know if you have more questions.
The "decision boundary" for a single layer perceptron is a plane (hyper plane)
where n in the image is the weight vector w, in your case w={w1=1,w2=2}=(1,2) and the direction specifies which side is the right side. n is orthogonal (90 degrees) to the plane)
A plane always splits a space into 2 naturally (extend the plane to infinity in each direction)
you can also try to input different value into the perceptron and try to find where the response is zero (only on the decision boundary).
Recommend you read up on linear algebra to understand it better:
For a perceptron with 1 input & 1 output layer, there can only be 1 LINEAR hyperplane. And since there is no bias, the hyperplane won't be able to shift in an axis and so it will always share the same origin point. However, if there is a bias, they may not share a same point anymore.
I have encountered this question on SO while preparing a large article on linear combinations (it's in Russian, It has a section on the weight space and I would like to share some thoughts from it.
Let's take a simple case of linearly separable dataset with two classes, red and green:
The illustration above is in the dataspace X, where samples are represented by points and weight coefficients constitutes a line. It could be conveyed by the following formula:
w^T * x + b = 0
But we can rewrite it vice-versa making x component a vector-coefficient and w a vector-variable:
x^T * w + b = 0
because dot product is symmetrical. Now it could be visualized in the weight space the following way:
where red and green lines are the samples and blue point is the weight.
More possible weights are limited to the area below (shown in magenta):
which could be visualized in dataspace X as:
Hope it clarifies dataspace/weightspace correlation a bit. Feel free to ask questions, will be glad to explain in more detail.
I think the reason why a training case can be represented as a hyperplane because...
Let's say
[j,k] is the weight vector and
[m,n] is the training-input
training-output = jm + kn
Given that a training case in this perspective is fixed and the weights varies, the training-input (m, n) becomes the coefficient and the weights (j, k) become the variables.
Just as in any text book where z = ax + by is a plane,
training-output = jm + kn is also a plane defined by training-output, m, and n.