Definition 1.1: Given a function $\varphi: \mathbb{R}^n \rightarrow \mathbb{R}^n$ which is extendable to the dual numbers, we say that its derivative $\varphi'(a)$ at a point $a \in \mathbb{R}^n$, if it exists, is a matrix satisfying the following:
$$\varphi(a + h\varepsilon) = \varphi(a) + \varphi'(a)h\varepsilon$$For any $h \in T_a\mathbb{R}^n$. Furthermore, we define the differential $d\varphi: T_a\mathbb{R}^n \rightarrow T_{\varphi(a)}\mathbb{R}^m$ of $\varphi$ at the point $a$ as the linear map $d\varphi(h) = \varphi'(a)h$ represented by the derivative of $\varphi$ at that point.
Example 1.1: Given the function $\varphi: \mathbb{R} \rightarrow \mathbb{R}$ satisfying $\varphi(x) = x^2$:
$$ \begin{align*} \varphi(x + h\varepsilon) &= (x + dx\varepsilon)^2\\ &= x^2 + 2x\,dx\varepsilon + dx^2\varepsilon^2\\ &= x^2 + 2x\,dx\varepsilon\\ &= \varphi(x) + 2x\,dx\varepsilon \end{align*} $$Therefore $\varphi'(x)$ is the 1 x 1 matrix $2x$ and $d\varphi = 2x\,dx$ at $x$. Incidentally, this means that $d\varphi$ divided by $dx$ is $2x$.
Example 1.2: Given the function $\varphi: \mathbb{R}^3 \rightarrow \mathbb{R}^2$ defined by $\varphi(x,y,z) = (xy, yz)$, we compute:
$$ \begin{align*} \varphi(a + h\varepsilon) &= \begin{bmatrix} (x + dx\varepsilon)(y + dy\varepsilon) \\ (y + dy\varepsilon)(z + dz\varepsilon) \end{bmatrix}\\ &= \begin{bmatrix} xy+x\,dy\varepsilon + y\,dx\varepsilon + dy\,dx\varepsilon^2 \\ yz + z\,dy\varepsilon + y\,dz\varepsilon + dy\,dz\varepsilon^2 \end{bmatrix}\\ &= \begin{bmatrix} xy \\ yz\end{bmatrix} + \begin{bmatrix} x\,dy\varepsilon + y\,dx\varepsilon \\ z\,dy\varepsilon + y\,dz\varepsilon \end{bmatrix}\\ &= \varphi(a) + \begin{bmatrix} y\,dx + x\,dy + 0\,dz \\ 0\,dx + z\,dy + y\,dz\end{bmatrix} \varepsilon\\ &= \varphi(a) + \begin{bmatrix} y & x & 0 \\ 0 & z & y \end{bmatrix}h\varepsilon\\ \end{align*} $$Therefore $\varphi'(x)$ is the 3 x 2 matrix shown above. Notice that this matrix has as many columns as $\varphi$'s domain has dimensions, and as many rows as $\varphi$'s codomain has dimensions. The same thing happened in Example 1.1, and is a general fact of derivatives.
An important special case is what happens when the codomain of $\varphi$ is just $\mathbb{R}$. We will provide one last example as a warmup.
Example 3: Given the function $u: \mathbb{R}^2 \rightarrow \mathbb{R}$ defined by $u(x,y) = \frac{y}{x}$, we compute:
$$ \begin{align*} u(a + h\varepsilon) &= \frac{(y + dy\varepsilon)}{(x + dx\varepsilon)}\\ &= (y + dy\varepsilon)\left(\frac{x - dx\varepsilon}{(x + dx\varepsilon)(x - dx\varepsilon)}\right)\\ &= (y + dy\varepsilon)(x - dx\varepsilon)/x^2\\ &= (y(x - dx\varepsilon) + dy\varepsilon(x - dx\varepsilon))/x^2\\ &= (xy - y\,dx\varepsilon + x\,dy\varepsilon - dy\,dx\varepsilon^2)/x^2\\ &= (xy - y\,dx\varepsilon + x\,dy\varepsilon)/x^2\\ &= \frac{y}{x} - \frac{y}{x^2}\,dx\varepsilon + \frac{1}{x}\,dy\varepsilon\\ &= u(a) + (\frac{-y}{x^2}\,dx + \frac{1}{x}\,dy)\varepsilon\\ &= u(a) + \begin{bmatrix} \frac{-y}{x^2} & \frac{1}{x}\end{bmatrix}h\varepsilon \end{align*} $$Thus, we can conclude that:
$$du = \frac{-y}{x^2}\,dx + \frac{1}{x}\,dy$$ $$u'(a) = \begin{bmatrix} \frac{-y}{x^2} & \frac{1}{x}\end{bmatrix}$$Motivated by this last example, we are now ready to define partial derivatives. The definition below relies on the fact that scalar valued differentiable functions will always have a derivative matrix with only one row.
Definition 2: Given a scalar valued $u: \mathbb{R}^n \rightarrow \mathbb{R}$, we define the partial derivative of $u$ with respect to $x_i$, where $1 \le i \le n$, to be the $i^\text{th}$ entry in the derivative matrix from the left. We denote this as $\partial u/\partial x_i$
It immediately follows that in Example 3, we had:
$$\frac{\partial u}{\partial x} = \frac{-y}{x^2} ,\,\,\,\, \frac{\partial u}{\partial y} = \frac{1}{x}$$More generally, for any $u: \mathbb{R}^n \rightarrow \mathbb{R}$, it follows from row-column vector multiplication, equivalent to the dot product, that at some particular point of differentiation, with $dx_i$ as the coordinates of the vector $h$:
$$du(h) = \sum_{1 \le i \le n} \frac{\partial u}{\partial x_i}dx_i$$When computing differentials, we used little nudge vectors. These nudge vectors can more rigorously be interpretted as the inputs of the differential, which perhaps was computed through the use of limits instead.
One thing we can ask about these is: where do they live? Well, we can image that when we choose a point at which to differentiate, we automatically create a new set of coordinate axes, centered at that point, where the nudge vectors will live.
We can name these axes $dx$ and $dy$, since they represent the little nudge in the x and y direction respectively. We can also consider all the possible nudge vectors in this new coordinate system. This space of nudge vectors centered at a point $a$ is called the tangent space at $a$, and denoted $T_a S$, where $S$ is the original space, in this case $\mathbb{R^n}$. Since differentials take nudge vectors as inputs, we can interpret them as linear maps between tangent spaces.