## Probability Q: You are given a coin that may or may not be biased. Specifically, you have three hypothesis about the coin: [on hold] - math

### Why the cost function of logistic regression has a logarithmic expression?

cost function for the logistic regression is
cost(h(theta)X,Y) = -log(h(theta)X) or -log(1-h(theta)X)
My question is what is the base of putting the logarithmic expression for cost function .Where does it come from? i believe you can't just put "-log" out of nowhere. If someone could explain derivation of the cost function i would be grateful. thank you.

Source: my own notes taken during Standford's Machine Learning course in Coursera, by Andrew Ng. All credits to him and this organization. The course is freely available for anybody to be taken at their own pace. The images are made by myself using LaTeX (formulas) and R (graphics).
Hypothesis function
Logistic regression is used when the variable y that is wanted to be predicted can only take discrete values (i.e.: classification).
Considering a binary classification problem (y can only take two values), then having a set of parameters θ and set of input features x, the hypothesis function could be defined so that is bounded between [0, 1], in which g() represents the sigmoid function:
This hypothesis function represents at the same time the estimated probability that y = 1 on input x parameterized by θ:
Cost function
The cost function represents the optimization objective.
Although a possible definition of the cost function could be the mean of the Euclidean distance between the hypothesis h_θ(x) and the actual value y among all the m samples in the training set, as long as the hypothesis function is formed with the sigmoid function, this definition would result in a non-convex cost function, which means that a local minimum could be easily found before reaching the global minimum. In order to ensure the cost function is convex (and therefore ensure convergence to the global minimum), the cost function is transformed using the logarithm of the sigmoid function.
This way the optimization objective function can be defined as the mean of the costs/errors in the training set:

This cost function is simply a reformulation of the maximum-(log-)likelihood criterion.
The model of the logistic regression is:
P(y=1 | x) = logistic(θ x)
P(y=0 | x) = 1 - P(y=1 | x) = 1 - logistic(θ x)
The likelihood is written as:
L = P(y_0, ..., y_n | x_0, ..., x_n) = \prod_i P(y_i | x_i)
The log-likelihood is:
l = log L = \sum_i log P(y_i | x_i)
We want to find θ which maximizes the likelihood:
max_θ \prod_i P(y_i | x_i)
This is the same as maximizing the log-likelihood:
max_θ \sum_i log P(y_i | x_i)
We can rewrite this as a minimization of the cost C=-l:
min_θ \sum_i - log P(y_i | x_i)
P(y_i | x_i) = logistic(θ x_i) when y_i = 1
P(y_i | x_i) = 1 - logistic(θ x_i) when y_i = 0

My understanding (not 100% expert here, I may be wrong) is that the log can be roughly explained as un-doing the exp that appears in the formula for a gaussian probability density. (Remember -log(x) = log(1/x).)
If I understand Bishop  correctly: When we assume that our positive and negative training samples come from two different gaussian clusters (different location but same covariance) then we can develop a perfect classifier. And this classifier looks just like logistic regression (e.g. linear decision boundary).
Of course, the next question is why should we use a classifier that is optimal to separate gaussian clusters, when our training data often looks different?
 Pattern Recognition and Machine Learning, Christopher M. Bishop, Chapter 4.2 (Probabilistic Generative Models)

### Log likelihood of a markov network

I am having trouble understanding the following figure from Coursera class:
From as far as I understand, the equation corresponds the factor table:
And therefore the likelihood of a sample data (a = 0, b=0, c=1) for example would be:
It doesn't look like the graph at any way. Can you please explain the graph for me?

I think you're confusing probability and likelihood.
You have a probability distribution p, parameterised by \theta, which has support on (A, B, C). The probability distribution is a function of A, B, C for fixed theta. The likelihood function, which is what's being graphed in the figure above, is a function of \theta for fixed A, B, C. It's a function which says how probable fixed observations are given different values for the parameters.
In popular usage likelihood and probability are synonymous. In technical use they are not.
With the likelihood/probability issue sorted, that likelihood function is telling you that the joint probability of (A, B, C) is the product of pairwise potentials between all connected pairs, in this case (A, B) and (B, C). I{a^1, b^1) is an indicator function which is 1 when a=1 and b=1 and zero otherwise. \theta_{a^1, b^1} is the parameter corresponding to this outcome.
If I had to guess (I can't see the whole class), I would say there are four \thetas for each pairwise relationship, representing the four possible states (both 1, both 0, or one of each), and we've just dropped the ones where the corresponding indicator function is zero and so the parameters are irrelevant.
Your derivation of the equation is not correct. The form of the MRF basically says add together the parameters corresponding to the correct state of each of the pairs, exponentiate, and normalise. The normalising constant is the sum of the joint probability over all possible configurations.

### Query regarding vector transpose in hypothesis function (Stanford Machine Learning Video Lecture 2)

I was looking at lecture 2 of the stanford machine learning lecture series taught by professor Andrew NG and had a question about something that I think might be quite rudimentary but is just not clicking in my head. So let us consider two vectors θ and x where both vectors contain real numbers.
Let h(x) be a function (in this specific case called the hypothesis) but let it be some function denoted by :
h(x) = "summation from i = 0 to i = n" of θ(i)*x(i) = θ(transpose)*x
I dont understand the last part where he says h(x) is also equal to θ(transpose)*x.
If someone could clarify this concept for my I would be very grateful.

It's just basic linear algebra, it follows from the definition of matrix vector multiplication:
So if θ and x are both n+1 x 1 matrices, then

### Algorithm for finding an equidistributed solution to a linear congruence system

I face the following problem in a cryptographical application: I have given a set of linear congruences
a*x+a*x+a*x == d (mod p)
b*x+b*x+b*x == d (mod p)
c*x+c*x+c*x == d (mod p)
Here, x is unknown an a,b,c,d are given
The system is most likely underdetermined, so I have a large solution space. I need an algorithm that finds an equidistributed solution (that means equidistributed in the solution space) to that problem using a pseudo-random number generator (or fails).
Most standard algorithms for linear equation systems that I know from my linear algebra courses are not directly applicable to congruences as far as I can see...
My current, "safe" algorithm works as follows: Find all variable that appear in only one equation, and assign a random value. Now if in each row, only one variable is unassigned, assign the value according to the congruence. Otherwise fail.
Can anyone give me a clue how to solve this problem in general?

You can use gaussian elimination and similar algorithms just like you learned in your linear algebra courses, but all arithmetic is performed mod p (p is a prime). The one important difference is in the definition of "division": to compute a / b you instead compute a * (1/b) (in words, "a times b inverse"). Consider the following changes to the math operations normally used
addition: a+b becomes a+b mod p
subtraction: a-b becomes a-b mod p
multiplication: a*b becomes a*b mod p
division: a/b becomes: if p divides b, then "error: divide by zero", else a * (1/b) mod p
To compute the inverse of b mod p you can use the extended euclidean algorithm or alternatively compute b**(p-2) mod p.
Rather than trying to roll this yourself, look for an existing library or package. I think maybe Sage can do this, and certainly Mathematica, and Maple, and similar commercial math tools can.

### what does this arg max notation mean in the scikit-learn docs for Naive Bayes?

I'm referring to the following page on Naive Bayes:
http://scikit-learn.org/stable/modules/naive_bayes.html
Specifically the equation beginning with y-hat. I think I generally understand the equations before that, but I don't understand the "arg max y" notation on that line. What does it mean?

Whereas the max of a function is the value of the output at the maximum, the argmax of a function is the value of the input ie the "argument" at the maximum.
In the equation in your example:
y_hat is the value of y, ie the the class label, that maximizes the right hand expression.
Here P(y) is typically the proportion of class y in the training set, also called the "prior", and P(x_i | y) is the probability of observing the feature value x_i if the true class is indeed y, also called the "likelihood".
To understand the product P(x_i | y) better, consider an example where you are trying to classify a sequence of coin flips as coming from either coin A which lands heads in 50% of training examples, or coin B which lands heads in 66.7% of training examples. Here each individual P(x_i | y_j) is the probability of coin y_j (where j is either a or b) landing x_i (where x_i is either heads or tails).
Training set:
THH A
HTT A
HTH A
TTH A
HHH B
HTH B
TTH B
Test set:
HHT ?
So the sequence HHT has a 0.667*0.667*0.333 = 0.148 likelihood given coin B, but only a 0.5*0.5*0.5 = 0.125 likelihood given coin A. However we estimate a 57% prior for coin A since A appears in 4/7 training examples, so we would end up predicting that this sequence came from coin A, since 0.57*0.125 > 0.43*0.148. This is because we are more likely to start with coin A, so coin A has more chances to produce some less-likely sequences.
If the prior for coins A and B were 50% each, then we would naturally predict coin B for HHT, since this sequence clearly has the highest likelihood given coin B.

From Wikipedia:
In mathematics, the arguments of the maxima (abbreviated arg max or argmax) are the points of the domain of some function at which the function values are maximized. In contrast to global maxima, referring to the largest outputs of a function, arg max refers to the inputs, or arguments, at which the function outputs are as large as possible.
In other words, argmax f(x) means the value of x (argument) that maximizes f(x); understandably, it is frequently encountered in optimization problems (which underlie most machine learning algorithms).
Informally speaking, numpy.argmax is a similar function for Numpy arrays (i.e. not functions); it gives the position for which the array value is maximum:
import numpy as np
x = np.array([3,1,8]) # maximum argument at position 2
np.argmax(x)
# 2