Neural Network
$$
NrOfWheights=layerSize^k\
NrOfPaths=layerSize_1 \cdot layerSize_2 + layerSize_2 \cdot layerSize_3 + ...
$$
Where \(layerSize\) is the number of nodes per layer and \(k\) the number of layers.
Feed-Forward Neural Network
In a feed-forward neural network, all layers all fully-connected (every node is connected to every node of the next layer), but have no connection with in the same layer. Thus the information can only flow forward.

A neuron calculates the wheighted-sum of all its input, subtracts a bias value and puts that result through the activation function.

There are different activation functions that can be used:

This is a more complete table:

Output Layer
Depending on the problem a different output layer is used.

(Softmax ensures that the results of the output nodes are percentages)
Instead of softmax one-vs-all can also be used.
Cost-Fuction

How to Train your Feedforward Neural Network

The following is true for partial derivations if \(f \circ g\) holds:
$$
\begin{align}
z&=f(y) & y = g(x) \
\frac{\part z}{\part x}&= \frac{\part z}{\part y}\frac{\part y}{\part x}
\end{align}
$$

Vanishing Gradient Problem
One problem with back progagation is that with large model and the sigmoid function is that all partial derivations are between \(-1\) and \(1\). This leads to the partial derivation getting smaller and smaller, thus the model learns very slowly. ReLu solves this to a degree, since the partial derivation can either be \(0\) or \(1\).
Optimizing
The following techniques are ways to optimise a model and prevent it from overfitting.
Dropout
When doing dropout, during training (during testing all nodes are used) some nodes are not updated. This should cause other neurons to learn the same behaviour and make the model overall more stable. Typical dropout rates are between 20% and 50%.
From experience, larger networks with dropout perform better then small networks without dropout.
Early Stopping

At a certain point during training a model can start to be over fitted and learn noise instead of the pattern in the data. This causes the model getting worse for data other then the training data (as can be seen in the diagram above).
To prevent this, one can introduce checkpoints, where loss and other quality measurements are taken and evaluated. If the score get worse, then revert to the previous best checkpoint.
Data Augmentation

To generate more training data, one can generate them artificially by:
- adding noise
- combining or extrapolate training samples
- modify existing training samples
This should cause the model to become more roust and stable.
This can also work for text:

In Case of Bad Performance
If a model doesn't predict to a satisfactory degree, one should analyse the learning curve when changing the following values:
- The number of training samples
- The number of hidden neurons
- The activiation functions
- Regularisation
- Learning rate / Learning rate decay
- Batch size
- Optimisation algorithms
- Number of epochs
- Early stopping
- Dropout
- Data augmentation
Architecture of Neural Networks

Universality Theorem
A neural net with one hidden layer and arbitrary number of neurons can approximate any given continuous function.

The main idea is to cut a function into small pieces and use two neurons to approximate these steps. The left diagram shows two neurons approximating a step function.
Keras

model = Sequential()
model.add(Dense(5, input_shape=(72,)))
model.add(Activation('relu'))
model.add(Dense(7, activation="relu"))
# Dropout disables random nodes
model.add(Dropout(rate=.2))
model.add(Dense(2))
# for numerical models
model.compile(optimizer='sgd', loss='mse')
# for categorical models
model.compile(loss="categorical_crossentropy", optimizer="sgd")
# Data:
X = np.random.random((100, 72))
y = np.random.random((100, 2))
# Train:
model.fit(X, y, epochs=1, batch_size=10)
