# Antimatroid, The

thoughts on mathematics, computer science and business

## MMXIII

With a new year comes some new thinking on the direction I’d like to take my work and consequently, the site. Since the site’s inception in 2008, I’ve attempted to follow a monthly publication format throughout the year with the usual summer hiatus. This approach has worked well in the past since most of my work consisted of weekend projects. As I’ve refined my capabilities, I’ve sought out more sophisticated interests and challenges to take on as projects and research opportunities. Naturally, these more advanced projects require more time to finish. As I think about the next year and beyond, I’m certain this trend will continue.

Humans are lousy at multitasking, working on multiple projects simultaneously starts to rapidly produce diminishing returns in quality of work. Since my projects will continue to grow in complexity, a monthly format will soon require that I work on dozens of projects in parallel. With each additional project, the amount of time I can dedicate to a project each month will diminish and each project will take progressively longer to complete. Consequently, I’ve decided that I’m no longer going to try and finish a project each month along with the associated monthly post, rather I’d like to focus on developing my project and research, and publishing by the maxim “it’s done when it’s done”.

I know many people in my position opt to publish progress reports for long standing projects. I’ve done this in the past and have always felt that it ended up requiring too much duplication without any real benefit and reading them always felt like reading an arbitrary page out of a piece of fiction- the context simply isn’t there and it simply isn’t interesting to read. With that in mind, I want to talk about what projects I am currently working on some projects that I’m considering approaching over the next year and beyond.

Last year I began working on a couple of Android projects, Viderefit and an unannounced platformer. Viderefit got out the door, and the platformer was stalled in its inception phase and I’ve been contemplating how it will come together on and off since then. This will be a larger programming project than the rest of the projects I have planned, but it will be a continuation of the work I’ve done in developing arcade games in years past. Since these larger programming projects are better suited for winter, beginning next year I will be focusing on this project in more detail. Similarly, I’ll be spending more time on Viderefit this upcoming spring as the weather here in Colorado starts to warm back up and my outdoor season begins.

For the past several years I’ve been responsible for designing and overseeing distributed processing systems. Much of what I’ve observed can be characterized mathematically as stochastic dynamical systems. I view this as an opportunity to take what I’ve learned in industry and to apply that knowledge to studying these systems more formally. Since last fall, I’ve been delving into the theory of stochastic processes and seeing how that theory corresponds to what I’ve observed in the field. My goal is provide a thorough analysis of these systems and under what conditions they are stable.

In an attempt to broaden the purview of my work, I’ve been exploring mathematical finance and its role in business. I’ve started writing an overview of stock and stock options from the point of view of all the parties involved- shareholders, members, the exchange, traders and so on. There is some overlap in the stochastic elements of my distributed processing systems research which I hope to apply here as well, in particular options pricing. Overall, the goal is to put together an accessible overview of the instrument suitable for developers.

Since the beginning of the year, I’ve been working on a 3D printed toy robot. The opportunity to combine art, electronics, and mechanics together was too good an opportunity to pass up. The focus is on combining multiple engineering disciplines to produce a tangible result. In particular, to explore the possibilities of 3D printing, get some exposure to designing simple analog circuits, and utilize rudimentary mechanical constructs to bring a simple toy robot to life. Due to the inherent complexities of the project, and prolonged logistics involved in sourcing electronics and mechanical materials, this project is anticipated to be finished before the end of the year.

In parallel to my work on distributed processing systems, I’ve spent time being involved in applying natural language processing to problems in the medical industry. This is one area that I see a lot of potential commercially and I think it is worthwhile to pursue the subject in greater detail. A couple of years ago I wrote about an interpreter framework and its use for evaluating sentential logic. I’d like to extend that work to predicate logic and soundness checking, and then applying that solution to evaluating the soundness of English text. I’m also interested in building a speech synthesizer that can mimic the voice of any speaker given ample training data and a Russian to English statistical translator.

My work this past fall on image processing and category recognition was a rewarding process. I’m currently interested in learning more about how I can augment the mobile computing experience with augmented reality, and in what ways that marriage can result in meaningful, practical solutions for end users, and not just something that satisfies some intellectual curiosity. Likewise, I’m interested in exploring how image processing and machine learning can be applied to interpret body language and other non-verbal communication.

I see the next year full of interesting projects and possibilities that I believe will broaden my understanding of several subjects and give me an opportunity to deepen my knowledge in areas where I’m already knowledgeable. The underlying themes here are exploring the fascinating possibilities that arise from the intersection of mathematics, computer science and business, and the theme of growth and continued pursuit of bigger and better intellectual challenges. We’ll see how the new format works out and whether or not it produces noticeably improved results by this time next year when I plan to revisit my progress on these projects.

Written by lewellen

2013-02-01 at 8:00 am

Posted in Announcements

## Expected Maximum and Minimum of Real-Valued Continuous Random Variables

### Introduction

This is a quick paper exploring the expected maximum and minimum of real-valued continuous random variables for a project that I’m working on. This paper will be somewhat more formal than some of my previous writings, but should be an easy read beginning with some required definitions, problem statement, general solution and specific results for a small handful of continuous probability distributions.

### Definitions

Definition (1) : Given the probability space, $(\Omega, \mathcal{F}, \mathbb{P})$, consisting of a set representing the sample space, $\Omega$, a $\text{Borel }\sigma \text{-algebra}$, $\mathcal{F}$, and a Lebesgue measure, $\mathbb{P}$, the following properties hold true:

1. Non-negativity: $\mathbb{P}(F) \ge 0 \quad \forall F \in \mathcal{F}$
2. Null empty set: $\mathbb{P}(\emptyset) = 0$
3. Countable additivity of disjoint sets $\displaystyle \mathbb{P}\left( \bigcup_{i=0}^{\infty} F_i \right) = \sum_{i=0}^{\infty} \mathbb{P}(F_i) \quad F_i \subset \mathcal{F}$

Definition (2) : Given a real-valued continuous random variable such that $X : \Omega \to \mathbb{R}$, the event the random variable takes on a fixed value, $x \in \mathbb{R}$, is the event $\lbrace \omega : X(\omega) = x \rbrace \in \mathcal{F}$ measured by the probability distribution function $f_X(x) = \mathbb{P}(X = x)$. Similarly, the event that the random variable takes on a range of values less than some fixed value, $x \in \mathbb{R}$, is the event $\lbrace \omega : X(\omega) \le x \rbrace \in \mathcal{F}$ measured by the cumulative distribution function $F_X(x) = \mathbb{P}(X \le x)$. By Definition, the following properties hold true:

1. $\displaystyle F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt$
2. $\displaystyle \lim_{x \to \infty} F_X(x) = 1$
3. $\displaystyle 1 - \int_{-\infty}^{x} t f_X(t) \, dt = \int_{x}^{\infty} t f_X(t) \, dt$

Defintion (3) : Given a second real-valued continuous random variable, $Y : \Omega \to \mathbb{R}$, The joint event $\lbrace \omega : X(\omega) = x, Y(\omega) = y \rbrace \in \mathcal{F}$ $(x,y) \in \mathbb{R}^2$ will be measured by the joint probability distribution $f_{X, Y}(x,y) = \mathbb{P}(X = x, Y = y)$. If $X$ and $Y$ are statistically independent, then $f_{X,Y}(x,y) = f_X(x) f_Y(y)$.

Definition (4) : Given a real-valued continuous random variable, $X : \Omega \to \mathbb{R}$, the expected value is $\displaystyle \mathbb{E}(X) = \int_{-\infty}^{\infty} x f_X(x) \, dx$.

Definition (5) : (Law of the unconscious statistician) Given a real-valued continuous random variable, $X$, and a function, $g : \mathbb{R} \to \mathbb{R}$, then $g(X)$ is also a real-valued continuous random variable and its expected value is $\displaystyle \mathbb{E}(g(X)) = \int_{-\infty}^{\infty} g(x) f_X(x) \, dx$ provided the integral converges. Given two real-valued continuous random variables, $X, Y$, and a function, $g : \mathbb{R}^2 \to \mathbb{R}$, then $g(X, Y)$ is also a real-valued continuous random variable and its expected value is $\displaystyle \mathbb{E}(g(X,Y)) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x,y) f_{X,Y}(x,y) \, dx \, dy$. Under the independence assumption of Definition (3), the expected value becomes $\displaystyle \mathbb{E}(g(X,Y)) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x,y) f_X(x) f_Y(y) \, dx \, dy$.

Remark (1) : For the remainder of this paper, all real-valued continuous random variables will be assumed to be independent.

### Problem Statement

Theorem (1) : Given two real-valued continuous random variables $X, Y \in \Omega \to \mathbb{R}$, then the expected value of the minimum of the two variables is $\mathbb{E} \left( \min{ \left( X, Y \right) } \right ) = \mathbb{E} \left( X \right ) + \mathbb{E} \left( X \right ) - \mathbb{E} \left( \max{ \left( X, Y \right) } \right )$.

Lemma (1) : Given two real-valued continuous random variables $X, Y \in \Omega \to \mathbb{R}$, then the expected value of the maximum of the two variables is $\displaystyle \mathbb{E} \left( \max{ \left ( X, Y \right) } \right ) = \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy$

Proof of Lemma (1) :

$\displaystyle \mathbb{E} \left( \max{ \left ( X, Y \right) } \right ) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \max{\left( x, y \right)} f_X(x) f_Y(y) \, dx \, dy$ (Definition (5))

$\displaystyle = \int_{-\infty}^{\infty} \int_{-\infty}^{x} x f_X(x) f_Y(y) \, dy \, dx + \int_{-\infty}^{\infty} \int_{-\infty}^{y} y f_X(x) f_Y(y) \, dx \, dy$ (Definition (1.iii))

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \left ( \int_{-\infty}^{x} f_Y(y) \, dy \right ) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \left ( \int_{-\infty}^{y} f_X(x) \, dx \right ) \, dy$ (Fubini’s theorem)

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy \quad \square$ (Definition (2.i))

Proof of Theorem (1)

$\displaystyle \mathbb{E} \left( \min{ \left ( X, Y \right) } \right ) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \min{\left( x, y \right)} f_X(x) f_Y(y) \, dx \, dy$ (Definition (4))

$\displaystyle = \int_{-\infty}^{\infty} \int_{x}^{\infty} x f_X(x) f_Y(y) \, dy \, dx + \int_{-\infty}^{\infty} \int_{y}^{\infty} y f_X(x) f_Y(y) \, dx \, dy$ (Definition (1.iii))

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \left ( \int_{x}^{\infty} f_Y(y) \, dy \right ) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \left ( \int_{y}^{\infty} f_X(x) \, dx \right ) \, dy$ (Fubini’s theorem)

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \left (1 - \int_{-\infty}^{x} f_Y(y) \, dy \right ) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \left (1 - \int_{-\infty}^{y} f_X(x) \, dx \right ) \, dy$ (Definition (2.iii))

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \left (1 - F_Y(x) \right ) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \left( 1 - F_X(y) \right ) \, dy$ (Definition (2.i))

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \, dx - \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \, dy - \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy$

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \, dy - \left ( \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy \right )$

$\displaystyle = \mathbb{E}(X) + \mathbb{E}(Y) - \mathbb{E} \left( \max{ \left( X, Y \right) } \right) \quad \blacksquare$ (Definition (4), Lemma (1))

Remark (2) : For real values $x, y \in \mathbb{R}$, $\min{\left(x,y\right)} = x + y - \max{ \left( x, y \right) }$.

Proof Remark (2) : If $x \ge y$, then $\min{\left(x,y\right)} = y$, otherwise $x$. If $x \ge y$, then $\max{\left(x,y\right)} = x$, otherwise $y$. If $x \ge y$, then $\min{\left(x,y\right)} = y + \left( x - \max{\left(x,y\right)} \right )$, otherwise, $\min{\left(x,y\right)} = x + \left( y - \max{\left(x,y\right)} \right )$. Therefore, $\min{\left(x,y\right)} = x + y - \max{\left(x,y\right)} \quad \square$

### Worked Continuous Probability Distributions

The following section of this paper derives the expected value of the maximum of real-valued continuous random variables for the exponential distribution, normal distribution and continuous uniform distribution. The derivation of the expected value of the minimum of real-valued continuous random variables is omitted as it can be found by applying Theorem (1).

#### Exponential Distribution

Definition (6) : Given a real-valued continuous exponentially distributed random variable, $X \sim \text{Exp}(\alpha)$, with rate parameter, $\alpha > 0$, the probability density function is $\displaystyle f_X(x) = \alpha e^{-\alpha x}$ for all $x \ge 0$ and zero everywhere else.

Corollary (6.i) The cumulative distribution function of a real-valued continuous exponentially distributed random variable, $X \sim \text{Exp}(\alpha)$, is therefore $\displaystyle F_X(x) = 1 - e^{-\alpha x}$ for all $x \ge 0$ and zero everywhere else.

Proof of Corollary (6.i)

$\displaystyle F_X(x) = \int_{-\infty}^{x} f_x(t) \, dt = \int_{0}^{x} \alpha e^{-\alpha t} \, dt = -\frac{\alpha}{\alpha} e^{-\alpha t} \bigg|_{0}^{x} = 1 - e^{- \alpha x} \quad \square$

Corollary (6.ii) : The expected value of a real-valued continuous exponentially distributed random variable, $X \sim \text{Exp}(\alpha)$, is therefore $\displaystyle \frac{1}{\alpha}$.

Proof of Corollary (6.ii)

The expected value is $\mathbb{E}(X) = \frac{1}{\alpha}$ by Definition (4) and Lemma (2) $\square$.

Lemma (2) : Given real values $\alpha, \gamma \in \mathbb{R} \quad \gamma \neq 0$, then $\displaystyle \int_{0}^{\infty} \alpha x e^{-\gamma x} \, dx = \frac{\alpha}{\gamma^2}$.

Proof of Lemma (2) :

$\displaystyle \int_{0}^{\infty} \alpha x e^{-\gamma x} \, dx = - x \frac{\alpha}{\gamma} e^{-\alpha x} \bigg|_{0}^{\infty} + \int_{0}^{\infty} \frac{\alpha}{\gamma} e^{-\gamma x} \, dx = - x \frac{\alpha}{\gamma} e^{-\alpha x} - \frac{\alpha}{\gamma}^2 e^{-\alpha x} \bigg|_{0}^{\infty}$

$\displaystyle = \lim_{x \to \infty} \left( - x \frac{\alpha}{\gamma} e^{-\alpha x} - \frac{\alpha}{\gamma^2} e^{-\alpha x} \right ) - \left( - \frac{\alpha}{\gamma^2} \right) = \frac{\alpha}{\gamma^2} \quad \square$

Theorem (2) : The expected value of the maximum of the real-valued continuous exponentially distributed random variables $X \sim \text{Exp}(\alpha)$, $Y \sim \text{Exp}(\beta)$ is $\displaystyle \frac{1}{\alpha} + \frac{1}{\beta} - \frac{1}{\alpha + \beta}$.

Proof of Theorem (2) :

$\displaystyle \mathbb{E} \left ( \max{\left( X, Y \right)} \right ) = \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy$ (Lemma (1))

$\displaystyle = \int_{0}^{\infty} x \alpha e^{-\alpha x} \left( 1 - e^{-\beta x} \right ) \, dx + \int_{0}^{\infty} y \beta e^{-\beta y} \left( 1 - e^{-\alpha y} \right ) \, dy$ (Corollary (6.i))

$\displaystyle = \left( \int_{0}^{\infty} x \alpha e^{-\alpha x} \, dx \right )- \left( \int_{0}^{\infty} x \alpha e^{-(\alpha + \beta) x} \, dx \right ) + \left( \int_{0}^{\infty} y \beta e^{-\beta y} \, dy \right ) - \left( \int_{0}^{\infty} y \beta e^{-(\alpha + \beta) y} \, dy \right)$ (Integral linearity)

$\displaystyle = \frac{1}{\alpha} - \frac{\alpha}{(\alpha + \beta)^2} + \frac{1}{\beta} - \frac{\beta}{(\alpha + \beta)^2}$ (Lemma (2), Corollary (6.ii))

$\displaystyle = \frac{1}{\alpha} + \frac{1}{\beta} - \frac{1}{\alpha+\beta} \quad \blacksquare$

#### Normal Distribution

Definition (7) : The following Gaussian integral is the error function $\displaystyle \text{erf}(x) = \frac{2}{ \sqrt{\pi} } \int_{0}^{x} e^{ - u^2 } \, du$ for which the following properties hold true:

1. Odd function: $\displaystyle \text{erf}(-x) = -\text{erf}(x)$
2. Limiting behavior: $\displaystyle \lim_{x \to \infty} \text{erf}(x) = 1$

Definition (8) : Given a real-valued continuous normally distributed random variable , $X \sim \mathcal{N}(\mu, \sigma)$, with mean parameter, $\mu$. and standard deviation parameter, $\sigma \neq 0$, the probability density function is $\displaystyle f_X(x) = \frac{1}{\sigma \sqrt{2 \pi} } e^{ -\frac{1}{2} \left ( \frac{x - \mu}{\sigma} \right )^2 }$ for all values on the real line.

Corollary (8.i) : The cumulative distribution function of a real-valued continuous normally distributed random variable, $X \sim \mathcal{N}(\mu, \sigma)$, is therefore $\displaystyle F_X(x) = \frac{1}{2} \left (1 + \text{erf} \left ( \frac{x-\mu}{\sqrt{2} \sigma} \right ) \right )$.

Proof of Corollary (8.i) :

$\displaystyle F_X(x) = \int^{x}_{-\infty} \frac{1}{\sigma \sqrt{2 \pi} } e^{ - \left ( \frac{t - \mu}{\sqrt{2} \sigma} \right )^2 } \, dt$ (Definition (2.i))

$\displaystyle = \frac{1}{ \sqrt{\pi} } \int_{-\infty}^{\frac{x - \mu}{\sqrt{2} \sigma}} e^{ - u^2 } \, du$ (U-substitution with $\displaystyle u = \frac{t - \mu}{\sqrt{2} \sigma} \implies du = \frac{1}{\sqrt{2} \sigma} dt$)

$\displaystyle = \frac{1}{ \sqrt{\pi} } \int_{-\infty}^{ 0 } e^{ - u^2 } \, du + \frac{1}{ \sqrt{\pi} } \int_{0}^{ \frac{x - \mu}{\sqrt{2} \sigma} } e^{ - u^2 } \, du$ (Definition (2.iii))

$\displaystyle = - \frac{1}{ \sqrt{\pi} } \int_{0}^{-\infty} e^{ - u^2 } \, du + \frac{1}{ \sqrt{\pi} } \int_{0}^{ \frac{x - \mu}{\sqrt{2} \sigma} } e^{ - u^2 } \, du$ (Reverse limits of integration)

$\displaystyle = \frac{1}{2} \lim_{u \to \infty} - \text{erf}(-u) + \frac{1}{2} \text{erf} \left ( \frac{x - \mu}{\sqrt{2} \sigma} \right )$ (Definition (7))

$\displaystyle = \frac{1}{2} \lim_{u \to \infty} \text{erf}(u) + \frac{1}{2} \text{erf} \left ( \frac{x - \mu}{\sqrt{2} \sigma} \right )$ (Definition (7.i))

$\displaystyle = \frac{1}{2} + \frac{1}{2} \text{erf} \left ( \frac{x - \mu}{\sqrt{2} \sigma} \right )$ (Definition (7.ii))

$\displaystyle = \frac{1}{2} \left (1 + \text{erf} \left ( \frac{x-\mu}{\sqrt{2} \sigma} \right ) \right ) \quad \square$

Corollary (8.ii) : The expected value of a real-valued continuous normally distributed random variable, $X \sim \mathcal{N}(\mu, \sigma)$, is therefore $\mathbb{E}(X) = \mu$.

Proof of Corollary (8.ii) :

$\displaystyle \mathbb{E}(X) = \int_{-\infty}^{\infty} x f_X(x) \, dx = \int_{-\infty}^{\infty} x \frac{1}{\sigma \sqrt{2 \pi} } e^{ -\frac{1}{2} \left ( \frac{x - \mu}{\sigma} \right )^2 } \, dx$ (Definition (4))

$\displaystyle = \int_{-\infty}^{\infty} (\sqrt{2}\sigma u + \mu) \frac{1}{\sqrt{\pi} } e^{ - u^2 } \, du$ (U-substitution with $\displaystyle u = \frac{x-\mu}{\sqrt{2} \sigma} \implies du = \frac{1}{\sqrt{2}\sigma} dx$ $\sqrt{2}\sigma u + \mu = x$)

$\displaystyle = \frac{\sqrt{2}\sigma}{\sqrt{\pi}} \int_{-\infty}^{\infty} u e^{ - u^2 } \, du + \frac{\mu}{\sqrt{\pi}} \int_{-\infty}^{\infty} e^{ - u^2 } \, du$ (Integral linearity)

$\displaystyle = \frac{\sqrt{2}\sigma}{\sqrt{\pi}} \left( \int_{-\infty}^{0} u e^{ - u^2 } \, du + \int_{0}^{\infty} u e^{ - u^2 } \, du \right ) + \frac{\mu}{\sqrt{\pi}} \int_{-\infty}^{\infty} e^{ - u^2 } \, du$ (Definition (1.iii))

$\displaystyle = \frac{\sqrt{2}\sigma}{\sqrt{\pi}} \left( - \int_{0}^{\infty} u e^{ - u^2 } \, du + \int_{0}^{\infty} u e^{ - u^2 } \, du \right ) + 2 \frac{\mu}{\sqrt{\pi}} \int_{0}^{\infty} e^{ - u^2 } \, du$ ($u e^{-u^2}$ is odd, $e^{-u^2}$ is even)

$\displaystyle = \mu \frac{2}{\sqrt{\pi}} \left ( \frac{\sqrt{\pi}}{2} \lim_{x \to \infty} \text{erf}(x) \right ) = \mu \quad \square$ (Definition (7), Definition (7.ii))

Definition (9) : Given a real-valued continuous normally distributed random variable, $X \sim \mathcal{N}(0, 1)$, the probability distribution function will be denoted as standard normal probability distribution function, $\phi(x)$, and the cumulative distribution function as the standard normal cumulative distribution function, $\Phi(x)$. By definition, the following properties hold true:

1. Non-standard probability density function: If $X \sim \mathcal{N}(\mu, \sigma)$, then $\displaystyle f_X(x) = \frac{1}{\sigma} \phi \left( \frac{x - \mu}{\sigma} \right )$
2. Non-standard cumulative distribution function: If $X \sim \mathcal{N}(\mu, \sigma)$, then $\displaystyle F_X(x) = \Phi\left( \frac{x - \mu}{\sigma} \right )$
3. Complement: $\Phi(-x) = 1 - \Phi(x)$

Definition (10) : [PaRe96] Given $\phi(x)$ and $\Phi(x)$, the following integrals hold true:

1. $\displaystyle \int_{-\infty}^\infty x\Phi(a+bx)\phi(x) \, dx = \frac{b}{\sqrt{1+b^2}} \phi \left( \frac{a}{\sqrt{1+b^2}} \right )$
2. $\displaystyle \int_{-\infty}^\infty \Phi(a+bx)\phi(x) \, dx = \Phi \left ( \frac{a}{\sqrt{1+b^2}} \right )$

Theorem (3) : The expected value of the maximum of the real-valued continuous normally distributed random variables $X \sim \mathcal{N}(\mu, \sigma)$, $Y \sim \mathcal{N}(\nu, \tau)$ is $\displaystyle \sqrt{ \sigma^2 + \tau^2 } \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + (\nu - \mu) \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) + \mu$.

Lemma (3) : Given real-valued continuous normally distributed random variables $X \sim \mathcal{N}(\mu, \sigma)$, $Y \sim \mathcal{N}(\nu, \tau)$, $\displaystyle \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy = \frac{\tau^2}{\sqrt{\sigma^2 + \tau^2 }} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \nu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right )$.

Proof of Lemma (3) :

$\displaystyle \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy = \int_{-\infty}^{\infty} y \frac{1}{\tau} \phi \left ( \frac{y-\nu}{\tau} \right ) \Phi \left ( \frac{y-\mu}{\sigma} \right ) \, dy$ (Definition (9.i), Definition (9.ii))

$\displaystyle = \int_{-\infty}^{\infty} (u \tau + \nu) \phi(u) \Phi \left ( \frac{u \tau + \nu -\mu}{\sigma} \right ) \, du$ (U-substitution with $\displaystyle u = \frac{y-\nu}{\tau} \implies du = \frac{1}{\tau} dy$, $y = u \tau + \nu$)

$\displaystyle = \tau \int_{-\infty}^{\infty} u \phi(u) \Phi \left ( \frac{u \tau + \nu -\mu}{\sigma} \right ) \, du + \nu \int_{-\infty}^{\infty} \phi(u) \Phi \left ( \frac{u \tau + \nu -\mu}{\sigma} \right ) \, du$ (Integral linearity)

$\displaystyle = \frac{\tau^2}{\sqrt{\sigma^2 + \tau^2 }} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \nu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) \, \square$ (Definition (10.i), Definition (10.ii))

Proof of Theorem (3) :

$\displaystyle \mathbb{E} \left( \max{ \left ( X, Y \right) } \right ) = \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy$ (Lemma (1))

$\displaystyle = \int_{-\infty}^{\infty} x \frac{1}{\sigma} \phi \left ( \frac{x-\mu}{\sigma} \right ) \Phi \left ( \frac{x-\nu}{\tau} \right ) \, dy + \int_{-\infty}^{\infty} y \frac{1}{\tau} \phi \left ( \frac{y-\nu}{\tau} \right ) \Phi \left ( \frac{y-\mu}{\sigma} \right ) \, dy$ (Definition (11.i), Definition (11.ii))

$\displaystyle = \frac{\sigma^2}{\sqrt{\sigma^2 + \tau^2}} \phi \left( \frac{\mu - \nu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \mu \Phi \left ( \frac{\mu - \nu}{\sqrt{\sigma^2 + \tau^2}} \right ) + \frac{\tau^2}{\sqrt{\sigma^2 + \tau^2 }} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \nu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right )$ (Lemma (3))

$\displaystyle = \frac{\sigma^2}{\sqrt{\sigma^2 + \tau^2}} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \mu \left ( 1 - \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) \right ) + \frac{\tau^2}{\sqrt{\sigma^2 + \tau^2 }} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \nu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right )$ (Definition (9.iii))

$\displaystyle = \frac{\sigma^2}{\sqrt{\sigma^2 + \tau^2}} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) - \mu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) + \mu + \frac{\tau^2}{\sqrt{\sigma^2 + \tau^2 }} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \nu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right )$

$\displaystyle = \sqrt{ \sigma^2 + \tau^2 } \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + (\nu - \mu) \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) + \mu \quad \blacksquare$

#### Continuous Uniform Distribution

Definition (11) : Given a real-valued continuous uniformly distributed random variable, $X \sim U(a,b)$, with inclusive boundaries $a, b$ such that $a < b$, the probability density function is $\displaystyle f_X(x) = \frac{1}{b-a}$ for all $x \in [a, b]$ and zero everywhere else.

Corollary (11.i) : The cumulative distribution function of a real-valued continuous uniformly distributed random variable, $X \sim U(a,b)$, is therefore $\displaystyle F_X(x) = \begin{cases} 0 & x < a \\ \frac{x-a}{b-a} & x \in [a,b] \\ 1 & x > b \end{cases}$.

Proof of Corollary (11.i) :

$\displaystyle F_X(x) = \int_{-\infty}^{\infty} f_X(t) \, dt = \int_{a}^{b} \frac{1}{b-a} \, dt = \frac{1}{b-a} x \bigg|_{a}^{x} = \begin{cases} 0 & x < a \\ \frac{x-a}{b-a} & x \in [a,b] \\ 1 & x > b \end{cases}\quad \square$.

Corollary (11.ii) : The expected value of a real-valued continuous uniformly distributed random variable, $X \sim U(a,b)$, is therefore $\displaystyle \frac{a+b}{2}$.

Proof of Corollary (11.ii)

$\displaystyle \mathbb{E}(X) = \int_{-\infty}^{\infty} x f_X(x) \, dx = \int_{a}^{b} x \frac{1}{b-a} \, dx = \frac{x^2}{2(b-a)} \bigg|_{a}^{b} = \frac{ b^2 -a^2 }{ 2(b-a) } = \frac{b+a}{2} \quad \square$

Theorem (4) : The expected value of the maximum of real-valued continuous uniformly distributed random variables $X \sim U(a,b)$, $Y \sim U(c,d)$ is $\displaystyle \begin{cases} \frac{c+d}{2} & a < b \le c < d \\ \frac{4 (b^3 - c^3) - 3 (c + a) (b^2 - c^2) }{6(d-c)(b-a)} + \frac{d^2 - b^2}{2(d-c)} & a \le c < b \le d \\ \frac{4(b^3-c^3) - 3(c+a)(b^2-c^2)}{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & a \le c < d \le b \\ \frac{4 (b^3-a^3) - 3 (c + a) (b^2 - a^2) }{6(d-c)(b-a)} + \frac{d^2-b^2}{2(d-c)} & c \le a < b \le d\\ \frac{4 (d^3 - a^3) -3 (c+a) (d^2-a^2) }{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & c \le a < d \le b \\ \frac{a+b}{2} & c < d \le a < b \end{cases}$.

Proof of Theorem (4) :

$\displaystyle \mathbb{E} \left ( \max{ \left ( X, Y \right )} \right ) = \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy$ (Lemma (1))

$\displaystyle = \int_{a}^{b} x \frac{1}{b-a} \begin{cases} 0 & x < c \\ \frac{x - c}{d-c} & x \in [c,d] \\ 1 & \text{otherwise} \end{cases} \, dx + \int_{c}^{d} y \frac{1}{d-c} \begin{cases} 0 & y < a \\ \frac{y - a}{b-a} & y \in [a,b] \\ 1 & \text{otherwise} \end{cases} \, dy$

Case (1) : $a < b \le c < d$

$\displaystyle = \left ( \int_{a}^{b} x \frac{1}{b-a} 0 \, dx \right ) + \left ( \int_{c}^{d} y \frac{1}{d-c} 1 \, dy \right )$

$\displaystyle = \frac{c+d}{2} \quad \square$

Case (2) : $a \le c < b \le d$

$\displaystyle = \left ( \int_{a}^{c} x \frac{1}{b-a} 0 \, dx + \int_{c}^{b} x \frac{1}{b-a} \frac{x-c}{d-c} \, dx \right ) + \left ( \int_{c}^{b} y \frac{1}{d-c} \frac{y-a}{b-a} \, dy + \int_{b}^{d} y \frac{1}{d-c} 1 \, dy \right )$

$\displaystyle = \frac{2 x^3 - 3 c x^2}{6(b-a)(d-c)} \bigg|_{c}^{b} + \frac{2 y^3 - 3ay^2 }{6(d-c)(b-a)} \bigg|_{c}^{b} + \frac{y^2}{2(d-c)} \bigg|_{b}^{d}$

$\displaystyle = \frac{2 (b^3 - c^3) - 3 c (b^2 - c^2) }{6(b-a)(d-c)} + \frac{2 (b^3 - c^3) - 3 a (b^2 - c^2) }{6(d-c)(b-a)} + \frac{d^2 - b^2}{2(d-c)}$

$\displaystyle = \frac{4 (b^3 - c^3) - 3 (c + a) (b^2 - c^2) }{6(d-c)(b-a)} + \frac{d^2 - b^2}{2(d-c)} \quad \square$

Case (3) : $a \le c < d \le b$

$\displaystyle = \left ( \int_{a}^{c} x \frac{1}{b-a} 0 \, dx + \int_{c}^{b} x \frac{1}{b-a} \frac{x-c}{d-c} \, dx + \int_{d}^{b} x \frac{1}{b-a} 1 \, dx \right ) + \left ( \int_{c}^{d} y \frac{1}{d-c} \frac{y-a}{b-a} \, dy \right)$

$\displaystyle = \frac{2x^3 - 3cx^2}{6(b-a)(d-c)} \bigg|_{c}^{b} + \frac{x^2}{2(b-a)} \bigg|_{d}^{b} + \frac{2y^3 - 3ay^2}{6(b-a)(d-c)} \bigg|_{c}^{d}$

$\displaystyle = \frac{2(b^3-c^3) - 3c(b^2-c^2)}{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} + \frac{2(d^3-c^3) - 3a(d^2-c^2)}{6(b-a)(d-c)}$

$\displaystyle = \frac{4(b^3-c^3) - 3(c+a)(b^2-c^2)}{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} \quad \square$

Case (4) : $c \le a < b \le d$

$\displaystyle = \left( \int_{a}^{b} x \frac{1}{b-a} \frac{x-c}{d-c} \, dx \right ) + \left( \int_{c}^{a} y \frac{1}{d-c} 0 \, dy + \int_{a}^{b} y \frac{1}{d-c} \frac{y-a}{b-a} \, dy + \int_{b}^{d} y \frac{1}{d-c} 1 \, dy \right )$

$\displaystyle = \frac{2 x^3 - 3 c x^2 }{6(d-c)(b-a)} \bigg|_{a}^{b} + \frac{2 y^3 - 3 a y^2 }{6(d-c)(b-a)} \bigg|_{a}^{b} + \frac{y^2}{2(d-c)} \bigg|_{b}^{d}$

$\displaystyle = \frac{2 (b^3-a^3) - 3 c (b^2 - a^2) }{6(d-c)(b-a)} + \frac{2 (b^3-a^3) - 3 a (b^2 -a^2) }{6(d-c)(b-a)} + \frac{d^2-b^2}{2(d-c)}$

$\displaystyle = \frac{4 (b^3-a^3) - 3 (c + a) (b^2 - a^2) }{6(d-c)(b-a)} + \frac{d^2-b^2}{2(d-c)} \quad \square$

Case (5) : $c \le a < d \le b$

$\displaystyle = \left ( \int_{a}^{d} x \frac{1}{b-a} \frac{x-c}{d-c} \, dx + \int_{d}^{b} x \frac{1}{b-a} 1 \, dx \right ) + \left ( \int_{c}^{a} y \frac{1}{d-c} 0 \, dy + \int_{a}^{d} y \frac{1}{d-c} \frac{y-a}{b-a} \, dy \right )$

$\displaystyle = \frac{2 x^3 -3 c x^2}{6(b-a)(d-c)} \bigg|_{a}^{d} + \frac{x^2}{2(b-a)} \bigg|_{d}^{b} + \frac{2 y^3 -3 a y^2}{6(b-a)(d-c)} \bigg|_{a}^{d}$

$\displaystyle = \frac{2 (d^3 - a^3) -3 c (d^2-a^2) }{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} + \frac{2 (d^3-a^3) -3 a (d^2-a^2)}{6(b-a)(d-c)}$

$\displaystyle = \frac{4 (d^3 - a^3) -3 (c+a) (d^2-a^2) }{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} \quad \square$

Case (6) : $c < d \le a < b$

$\displaystyle = \left ( \int_{a}^{b} x \frac{1}{b-a} 1 \, dx \right ) + \left ( \int_{c}^{d} y \frac{1}{d-c} 0 \, dy \right )$

$\displaystyle = \frac{a+b}{2}$

$\displaystyle \therefore \mathbb{E} \left ( \max{\left ( X, Y \right )} \right ) = \begin{cases} \frac{c+d}{2} & a < b \le c < d \\ \frac{4 (b^3 - c^3) - 3 (c + a) (b^2 - c^2) }{6(d-c)(b-a)} + \frac{d^2 - b^2}{2(d-c)} & a \le c < b \le d \\ \frac{4(b^3-c^3) - 3(c+a)(b^2-c^2)}{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & a \le c < d \le b \\ \frac{4 (b^3-a^3) - 3 (c + a) (b^2 - a^2) }{6(d-c)(b-a)} + \frac{d^2-b^2}{2(d-c)} & c \le a < b \le d\\ \frac{4 (d^3 - a^3) -3 (c+a) (d^2-a^2) }{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & c \le a < d \le b \\ \frac{a+b}{2} & c < d \le a < b \end{cases} \quad \blacksquare$

### Summary Table

The following summary table lists the expected value of the maximum of real-valued continuous random variables for the exponential distribution, normal distribution and continuous uniform distribution. The corresponding minimum can be obtained by Theorem (1).

Random Variables Maximum
$X \sim$ $Y \sim$
$\text{Exp}(\alpha)$ $\text{Exp}(\beta)$ $\displaystyle \frac{1}{\alpha} + \frac{1}{\beta} - \frac{1}{\alpha + \beta}$
$\mathcal{N}(\mu, \sigma)$ $\mathcal{N}(\nu, \tau)$ $\displaystyle \sqrt{ \sigma^2 + \tau^2 } \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + (\nu - \mu) \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) + \mu$
$\text{U}(a, b)$ $\text{U}(c, d)$ $\displaystyle \begin{cases} \frac{c+d}{2} & a < b \le c < d \\ \frac{4 (b^3 - c^3) - 3 (c + a) (b^2 - c^2) }{6(d-c)(b-a)} + \frac{d^2 - b^2}{2(d-c)} & a \le c < b \le d \\ \frac{4(b^3-c^3) - 3(c+a)(b^2-c^2)}{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & a \le c < d \le b \\ \frac{4 (b^3-a^3) - 3 (c + a) (b^2 - a^2) }{6(d-c)(b-a)} + \frac{d^2-b^2}{2(d-c)} & c \le a < b \le d\\ \frac{4 (d^3 - a^3) -3 (c+a) (d^2-a^2) }{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & c \le a < d \le b \\ \frac{a+b}{2} & c < d \le a < b \end{cases}$

### References

[GrSt01] Grimmett, Geoffrey, and David Stirzaker. Probability and Random Processes. Oxford: Oxford UP, 2001. Print.

[PaRe96] Patel, Jagdish K., and Campbell B. Read. Handbook of the Normal Distribution. 2nd ed. New York: Marcel Dekker, 1996. Print.

Written by lewellen

2013-01-01 at 8:00 am

## Parallel Merge Sort in Java

### Introduction

This past November I was a pretty busy getting settled into a new job and trying to balance life’s other priorities. With a new job also came a new technology stack and while I’ll continue to do C# development in my free time, I’m going to be going back to doing Java development after a seven year hiatus. Before starting the new job, I decided to refresh my memory of the language’s finer details when it comes to generics and threading. So, I decided to implement something simple and settled on a parallel implementation of merge sort. This article is going to focus on making use of Java’s various features and evaluating the theoretical and empirical run time performance of the sequential and parallel versions of the algorithm.

### Sequential Approach

#### Specification

Given a list of values, the list is sorted by employing a divide and conquer method that partitions the list into two (roughly) equal sized partitions, followed by recursively sorting each partition and then merging the two resulting sorted partitions into the final sorted list.

#### Pseudocode

 $\displaystyle \textbf{MERGE}(X, Y) \newline \indent L_X \leftarrow \textbf{LENGTH}(X) \newline \indent L_Y \leftarrow \textbf{LENGTH}(Y) \newline \indent L_Z \leftarrow L_X + L_Y \newline \indent Z \leftarrow [L_Z] \newline \indent i, j, k \leftarrow 0, 0, 0 \newline \newline \indent \textbf{while} \quad k < L_Y \newline \indent \indent \textbf{if} \quad i < L_X \land j \ge L_Y \newline \indent \indent \indent \indent Z[k] \leftarrow X[i] \newline \indent \indent \indent \indent i \leftarrow i + 1 \newline \indent \indent \textbf{else-if} \quad i \ge L_X \land j < L_Y \newline \indent \indent \indent \indent Z[k] \leftarrow Y[j] \newline \indent \indent \indent \indent j \leftarrow j + 1 \newline \indent \indent \textbf{else-if} \quad i < L_X \land j < L_Y \newline \indent \indent \indent \textbf{if} \quad X[i] \le Y[j] \newline \indent \indent \indent \indent Z[k] \leftarrow X[i] \newline \indent \indent \indent \indent i \leftarrow i + 1 \newline \indent \indent \indent \textbf{else} \newline \indent \indent \indent \indent Z[k] \leftarrow Y[j] \newline \indent \indent \indent \indent j \leftarrow j + 1 \newline \indent \indent k \leftarrow k + 1 \newline \newline \indent \textbf{return} \quad Z$ $\displaystyle \textbf{MERGE-SORT}(X) \newline \indent L \leftarrow \textbf{LENGTH}(X) \newline \indent \textbf{if} \quad L \le 1 \newline \indent \indent \textbf{return} \quad X \newline \newline \indent \textbf{return} \quad \textbf{MERGE} ( \newline \indent \indent \textbf{MERGE-SORT} ( \newline \indent \indent \indent \textbf{PARTITION}(X, 0, \lfloor\ L / 2 \rfloor + L \mod 2) \newline \indent \indent ), \newline \indent \indent \textbf{MERGE-SORT}( \newline \indent \indent \indent \textbf{PARTITION}(X, \lfloor\ L / 2 \rfloor + L \mod 2, \lfloor\ L / 2 \rfloor) \newline \indent \indent ) \newline \indent )$ $\displaystyle \textbf{PARTITION}(X, s, L) \newline \indent Y \leftarrow [L] \newline \indent k \leftarrow 0 \newline \newline \indent \textbf{while} \quad k < L \newline \indent \indent Y[k] \leftarrow X[s + k] \newline \indent \indent k \leftarrow k + 1 \newline \newline \indent \textbf{return} \quad Y$

#### Time Complexity

In terms of time complexity, the algorithm is on the order of $\mathcal{O}(n \log_2(n))$. To show this, observe that the input size, $n$, is divided into to two equal parts, $2 T(n/2)$, followed by a merge operation, $f(n)$. This leads to the recurrence relation given by $\displaystyle T(n) = \begin{cases} 1 & n \le 1 \\ 2 T(n/2) + f(n) & n > 1 \end{cases}$. By induction, the recurrence relation is reduced to $\displaystyle T(n) = 2^k T(n/2^k) + \sum_{m = 0}^{k-1} 2^n f \left ( \frac{n}{2^m} \right )$. Observing that the merge function is on the order $\mathcal{O}(n)$, i.e., $f(n) = c n$, then the expression reduces further to $\displaystyle T(n) = 2^k T \left ( \frac{n}{2^k} \right ) + \sum_{m = 0}^{k-1} c n$ and $\displaystyle T(n) = 2^k T \left ( \frac{n}{2^k} \right ) + c n k$. As the number of subdivisions increases, eventually $n$ will be reduced to $1$. As such, let $1 = n/2^k$ which implies $2^k = n$ which implies $k = \log_2(n)$, and thus $T(n) = n T(1) + c n \log_2(n)$. Therefore, $T(n) \subset \mathcal{O}(n \log_2 n) \quad \square$

#### Implementation

In attempting to implement a generic version of merge sort there were a few matters that needed to be addressed. First, the type being sorted required an order relation to be specified so that the merge operation could take place. This is facilitated by restricting the type parameter T to Comparable<T>. Secondly, I had forgotten that you can’t initialize arrays of generics in Java like you can in C# [1]. To workaround this limitation, I settled on specifying the desired operations over implementations of the List<T> interface. Finally, since the List<T> interface makes no guarantees that its implementations provide (near) constant time reading or writing of elements from the list, an additional generic parameter, L, was added so that only those implementations of the List<T> and RandomAccess [2] interfaces could use this implementation of merge sort. The rest of the implementation is a near facsimile of the pseudocode.

package com.wordpress.antimatroid;

import java.util.List;
import java.util.RandomAccess;

public interface IListOperation
<T, L extends List<T> & RandomAccess> {

L execute();
}

package com.wordpress.antimatroid;

import java.util.ArrayList;
import java.util.List;
import java.util.RandomAccess;

public class CopyListOperation
<T, L extends List<T> & RandomAccess>
implements IListOperation<T, L> {

private final L source;
private final int length, initialIndex;

public CopyListOperation(L source, int length, int initialIndex) {
if(source == null)
throw new IllegalArgumentException("source must be non-null.");

if(length < 0)
throw new IllegalArgumentException(String.format(
"length, %d, must be greater than or equal to zero.", length
));

if(initialIndex < 0)
throw new IllegalArgumentException(String.format(
"initialIndex, %d, must be greater than or equal to zero.", initialIndex
));

if(initialIndex + length > source.size())
throw new IllegalArgumentException(String.format(
"initialIndex, %d, + length, %d, must be less than or equal to source.size(), %d.",
initialIndex, length, source.size()
));

this.source = source;
this.length = length;
this.initialIndex = initialIndex;
}

@Override
public L execute() {
L destination = (L) new ArrayList<T>(length);
for(int i = 0; i < length; i++)
return destination;
}
}

package com.wordpress.antimatroid;

import java.util.ArrayList;
import java.util.List;
import java.util.RandomAccess;

public class MergeListOperation
<T extends Comparable<T>, L extends List<T> & RandomAccess>
implements IListOperation<T, L> {

private final L a, b;

public MergeListOperation(L a, L b) {
if(a == null)
throw new IllegalArgumentException("a must not be null.");

if(b == null)
throw new IllegalArgumentException("b must not be null.");

this.a = a;
this.b = b;
}

@Override
public L execute() {
int length = a.size() + b.size();
L c = (L) new ArrayList<T>(length);

int i = 0, j = 0;
for(int k = 0; k < length; k++) {
if(i < a.size() && j < b.size()) {
if(a.get(i).compareTo(b.get(j)) <= 0) {
} else {
}
} else if (i < a.size() && j >= b.size()) {
} else if (i >= a.size() && j < b.size()) {
} else {
break;
}
}

return c;
}
}

package com.wordpress.antimatroid;

import java.util.List;
import java.util.RandomAccess;

public class MergeSortListOperation <
T extends Comparable<T>,
L extends List<T> & RandomAccess
> implements IListOperation<T, L> {

private final L a;

public MergeSortListOperation(L a) {
if(a == null)
throw new IllegalArgumentException("a must not be null.");

this.a = a;
}

@Override
public L execute() {
if(a.size() <= 1)
return a;

CopyListOperation<T, L> leftPartition
= new CopyListOperation<T, L>(a, (a.size() / 2) +  a.size() % 2, 0);
CopyListOperation<T, L> rightPartition
= new CopyListOperation<T, L>(a, (a.size() / 2), (a.size() / 2) +  a.size() % 2);

MergeSortListOperation<T, L> leftSort
= new MergeSortListOperation<T, L>(leftPartition.execute());
MergeSortListOperation<T, L> rightSort
= new MergeSortListOperation<T, L>(rightPartition.execute());

MergeListOperation<T, L> merge
= new MergeListOperation<T, L>(leftSort.execute(), rightSort.execute());

return merge.execute();
}
}


#### Run Time Analysis

Noting that the theoretical time complexity is $\mathcal{O}(n \log_2 n)$, inputs of the form $2^k$ will yield a $k 2^k$ curve. Taking the logarithm of which will give $\log(k) + k$. Observing that as $k$ increases the linear term will dominate the expression. As a result, the curve should look near linear in logarithmic space with the exception of small values of $k$. Which means that conducting a linear least squares regression of the empirical run times in logarithmic space will yield a satisfactory approximation to the theoretical time complexity.

To verify that the implementation follows the theoretical time complexity, increasing values of $k$ were used to generate lists containing $2^k$ random values. These lists were then sorted and the System.nanoTime() before and after values were used to determine the elapsed time. These values were collected and a total of 50 identical trails were conducted on an Intel Core i7-2630QM CPU @ 2.00 GHz based machine with 6.00 GB RAM.

As presented in the plot, the regressed linear model in logarithmic space yields a satisfactory theoretical curve whose relative error to the empirical curve diminishes to zero as the input size increases.

### Parallel Approach

#### Specification

The parallel implementation operates under the premise that the divide portion of merge sort can be easily parallelized by sorting one partition on the present thread and sorting the other partition on a secondary thread. Once the secondary thread has completed, then the two threads join, and consequently, the two sorted lists are merged. To avoid copious thread creation, whenever the input size is less than a threshold, $\tau$, the sequential version of the algorithm is used.

This process can be easily visualized below where each left-hand branch is the originating thread processing the first partition, each right-hand branch is the secondary thread processing the second partition and the junction of those edges represents the consequent merge operation after the secondary thread as joined back in with the originating thread.

#### Time Complexity

The introduction of parallelism changes the original recurrence relation to the following:

$T(N) = \begin{cases} 1 & n \le 1 \\ 2T(n/2) + f(n) & n \le \tau \\ \max{\left (T(n/2),T(n/2)\right )} + f(n) & n > \tau \end{cases}$

Assuming, $\tau = 1$, and that there is no asymptotic difference in sorting the first and second partition, then the time complexity is on the order of $\mathcal{O}(n)$. To see this, observe that the recurrence relation becomes $T(N) = \begin{cases} 1 & n \le 1 \\ T(n/2) + f(n) & n > 1 \end{cases}$ under the presented assumtions. Following the same process of induction as in the sequential case, the recurrence relation reduces to $\displaystyle T(n) = T \left ( \frac{n}{2^k} \right ) + \sum_{m=0}^{k-1} f \left ( \frac{n}{2^m} \right )$ and is simplified further under the assumption $f(n) = c n$ to $\displaystyle T(n) = T \left ( \frac{n}{2^k} \right ) + c n \sum_{m=0}^{k-1} \frac{1}{2^m}$. Observing that the sum is a finite geometric series leads to $\displaystyle T(n) = T \left ( \frac{n}{2^k} \right ) + c n 2 (1 - \frac{1}{2^{k-1}})$ and under the same reduction argument as before to $T(n) = T(1) + c n 2 (1 - 2/n)$. Thus, the time complexity of the parallel merge sort specified is $T(n) \subset \mathcal{O}(n) \quad \square$

Assuming $\tau = \infty$, then the time complexity of the algorithm is still on the order $\mathcal{O}(n \log_2 n)$. Thus, for various values of $\tau \in [0, \infty)$ and $n \ge 2$, the time complexity is between $\mathcal{O}(n \log_2 n) \le T(n) \le \mathcal{O}(n)$.

#### Implementation

In terms of parallelizing the sequential implementation, an addition interface, IThreadedListOperation was added to provide a BeginOperation, EndOperation asynchronous programming model found in the .net world. After looking around the Java world, I didn’t encounter a preferred idiom, so I went with what I knew.

As I mentioned in the sequential approach, the original data structures were going to be arrays which have a guarantee of providing thread safe reads, but not necessarily thread safe writes. To avoid the issue all together, I decided that the IListOperations should always return a new List<T> instance so that only one thread at a time would be reading or manipulating that memory. Since I knew my implementation would not be sharing IListOperations between threads, I decided not to gold plate the implementation with synchronization constructs. If in the future such ability were required, I would go back and modify the code accordingly.

For the parallel implementation I took advantage of the fact that method arguments are evaluated left-to-right [3] to save one some space, but if the specification ever changed, then it would be more appropriate to move the out the leftSort.execute() and rightSort.executeEnd() methods up a line to form a more explicit operation.

package com.wordpress.antimatroid;

import java.util.List;
import java.util.RandomAccess;

<T, L extends List<T> & RandomAccess>
implements Runnable, IListOperation<T, L> {

public void executeBegin() {
throw new IllegalStateException();

}

public L executeEnd() {
throw new IllegalStateException();

try {
} catch (InterruptedException e) {

}

return getResult();
}

public L execute() {
throw new IllegalStateException();

run();
return getResult();
}

abstract protected L getResult();
}

package com.wordpress.antimatroid;

import java.util.List;
import java.util.RandomAccess;

<T extends Comparable<T>, L extends List<T> & RandomAccess>

private final L a;
private L b;

private final int threshold;

this(a, 1024);
}

public MergeSortThreadedListOperation(L a, int threshold) {
if(a == null)
throw new IllegalArgumentException("a must be non-null.");

if(threshold <= 0)
throw new IllegalArgumentException("threshold must be greater than zero.");

this.a = a;
this.threshold = threshold;
}

@Override
public void run() {
if(a.size() <= 1) {
b = a;
return;
}

if(a.size() <= threshold) {
MergeSortListOperation<T, L> mergeSort = new MergeSortListOperation<T, L>(a);
b = mergeSort.execute();
return;
}

CopyListOperation<T, L> leftPartition
= new CopyListOperation<T, L>(a, (a.size() / 2) +  a.size() % 2, 0);

CopyListOperation<T, L> rightPartition
= new CopyListOperation<T, L>(a, (a.size() / 2), (a.size() / 2) +  a.size() % 2);

rightSort.executeBegin();

MergeListOperation<T, L> merge
= new MergeListOperation<T, L>(leftSort.execute(), rightSort.executeEnd());

b = merge.execute();
}

@Override
protected L getResult() {
return b;
}
}


#### Run Time Analysis

Noting that the time complexity for the parallel approach is $\mathcal{O}(n)$, a simple linear least squares regression of the empirical run times in normal space will yield a satisfactory approximation to the theoretical time complexity.

The trial methodology used in the sequential run time analysis is used once again to produce the following plot. Note that it begins at 2048 instead of 1. This was done so that only the parallel implementation was considered and not the sequential implementation when the input size is $\le 1024$.

As presented in the plot, the regressed linear model in logarithmic space yields a satisfactory theoretical curve whose relative error to the empirical curve diminishes to zero as the input size increases.

#### Threshold Selection

As a thought experiment, it makes sense that as the threshold approaches infinity, that there is no difference between the sequential implementation and parallel one. Likewise, as the threshold approaches one, then the number of threads being created becomes exceedingly large and as a result, places a higher cost on parallelizing the operation. Someplace in the middle ought to be an optimal threshold that yields better run time performance compared to the sequential implementation and a pure parallel implementation. So a fixed input size should produce a convex curve as a function of the threshold and hence have a global minimum.

Conducting a similar set of trials as the ones conducted under the analysis of the sequential run time give a fully parallel and sequential curve which to evaluate where the optimal threshold resides. As the plot depicts, as the threshold approaches one, there is an increase in the processing taking the form of a convex curve. As the threshold exceeds the input size, then the sequential approach dominates. By conducting a Paired T-Test against the means of the two curves at each input size, 1024 was determined to be the optimal threshold based on the hardware used to conduct the trials. As the input size grows, it is evident that for thresholds less than 1024, the sequential approach requires less time and afterwards, the parallel approach is favorable.

### Conclusion

In comparing the sequential and parallel implementations it was observed that the specified parallel implementation produced as much as a 2.65 factor improvement over the specified sequential implementation for megabyte sized lists.

Larger sized lists exhibited a declining improvement factor. It is presumed that as the input size grows that the amount of memory being created is causing excessive paging and as a result increasing the total run time and consequently reducing the improvement factor. To get around this limitation, the algorithm would need to utilize an in-place approach and appropriate synchronization constructs put into place to guarantee thread safety.

From a theoretical point of view, the improvement factor is the ratio of the run time of the sequential implementation to the parallel implementation. Using the time complexities presented, $\displaystyle S = \frac{n \log_2 n}{n}$. Taking the limit as the input size grows to infinity gives $\displaystyle \lim_{n \to \infty} \log_2 n = \infty$. So if there is any upper bound to the improvement factor it should be purely technical.

### Footnotes

[1] This design decision is discussed in §4.7 of the Java Language Specification (3rd Edition) on reifiable types.

[2] The only two java.util classes providing this guarantee are ArrayList and Vector. Both of which implement the interface RandomAccess which is intended indicate that the class provides the (near) constant reading and writing of elements.

[3] The left-to-right order of operations is specified by §15.7.4 of the Java Language Specification (3rd Edition). Also worth noting the specification recommends against the practice of relying on this convention however in §15.7:

… It is recommended that code not rely crucially on this specification. Code is usually clearer when each expression contains at most one side effect, as its outermost operation, and when code does not depend on exactly which exception arises as a consequence of the left-to-right evaluation of expressions.

Written by lewellen

2012-12-01 at 8:00 am

## Viderefit: A Fitness Tracking App for Android Tablets

### Introduction

Earlier this year I talked a bit about how I wanted to do some Android development to broaden my skill set. A little after that post I finally joined the 21st century and got an Android smartphone. Over the course of the summer I recorded all of my hikes and bike rides using Google’s My Tracks app. With my season coming to a close, I began to think about what I could do with all this data that I’d collected and what kind of insights I could discover. As a result, I came up with Viderefit, a simple Android tablet app, that allows me to review my changes in my performance over time. In this post I’m going to go over the product design cycle that went into making this first phase of the app- from brain storming, requirements building, user interface design, development, and post-production. I’ll be finishing up with some thoughts on the next set of features I’ll be contemplating to make the app commercially viable on Google Play.

### Goals

For the first phase of the project, I set out with a few simple goals that I wanted to focus on:

• Since I’d been focusing research projects lately, I wanted to return to my roots and focus a bit on user experience and interface design. As a result, I decided It was important to me that I create an intuitive to use and visually appealing user interface that utilized a number of appropriate and meaningful information visualization techniques.
• I’ve done a lot of C# and Haskell development lately, and I wanted to do something relatively new, so I decided that I would develop on Android and dust off my Java skills from my college days.
• I wanted a “quick win”, something that I knew that I could complete over the course of a couple weeks. As a result, I decided that I would spend two weeks planning the project starting in mid-September, followed by another two weeks of development wrapping up mid-October, and the remaining two weeks of October to put together this post for a November publication.

### Brain Storming

In thinking about what exactly it was I was going to build I began to ask myself, what questions should I be asking myself. So, I opened up Word and began typing out a bullet point list of questions to understand where I was going with this project. First thing I began to think about was what exactly is physical fitness? What exactly is being measured over time to show improvement? What exactly is improvement? I had some ideas from my experience, but nothing formal, so like anyone else, I jumped Wikipedia and came across the following quotation on the topic’s page:

Physical fitness has been defined as a set of attributes or characteristics that people have or achieve that relates to the ability to perform physical activity. – Physical Activity and Health: A Report of the Surgeon General

Not being completely satisfied with this, I looked around a bit more and found several pages outlining five areas that constitute physical fitness: aerobic or cardiovascular endurance, body composition, muscular strength, muscular endurance and finally, flexibility. Having felt like some progress was made, I moved on to the next question pertaining to measurements. Researching each of the five areas yielded some insights in the types of tests and measurements being used to assess these abilities such as VO2 max, BMI, ROM, S&R and a whole slew of alphabet soup measurements that I unfortunately did not have access to nor were they obtainable from the available set of data.

Thinking a bit more about the data that was available to me, it was clear the only area of physical fitness I could focus on was aerobic endurance. Despite the fact I lacked sufficient data to derive some of the formal measures of physical fitness, I could derive some common sense measures to relate my performance over time. Am I going longer, going further, going faster as I got deeper into my season? Is my improvement uniform over time or did I hit any plateaus? And so on. To explore these ideas, I exported all of the My Tracks data from my smartphone to a SD Card and combined the results using a throwaway C# application and loaded the combined CSV file into Excel.

Based on my explorations in Excel, I decided that I had the data I needed to answer the types of common sense question I was thinking about and decided what I was after was three different views of my data: a summary dashboard, performance reporting and a raw view of the data.

### Requirements

In deciding on my requirements, I thought a bit about what I had read in Ben Fry‘s Visualizing Data: Exploring and Explaining Data, that exploring most data sets consists of acquiring, parsing, filtering, mining, representing, refining and interacting with the data set. Keeping that in mind, I decided that I would likely have a series of tabs providing different views of the underlying data and sets of tools on each tab to assist in that data’s interpretation.

The summary dashboard is about capturing the “big picture” of the underlying data. In thinking about what was important to me, I wanted to capture how far I’d gone, how long I’d spent and how many times I went out. Each of these three sections would be broken down into a total, a percentage of some reference figures (e.g., the distance between cities), a chart showing the total broken out by activity type and month, box plot showing the underling distribution, a stacked bar chart showing the underlying activity distribution and finally the longest, furthest, or most common track was to be presented.

Performance reporting is about enabling the end user to explore the underlying data. In essence, enabling the end user to chart different features plotted against one another and summarized according to some scheme. The user would then be able to filter by activity type and break the data set out into pre-season, mid-season and post-season components to better see trends taking place over time.

Finally, the raw view of the data provides a listing of all the tracks that were captured. Selecting a track displays a speed and altitude plot along with box plots for speed and altitude for the track in addition to box plots for the season so that the user can compare how a particular track compares to seasonal statistics.

### Design

With an idea of the type of requirements that would be needed, it is time to flush out what the user interface will look like and how the user will interact with it. There are a lot of great products out there for mocking up user interfaces, I’ve used Balsamiq and really enjoyed it, but decided for this project, I would keep things simple and just mock things up in Photoshop since it’s easy to work with and low fidelity designs are all that’s needed at this point.

The summary dashboard incorporates the requirements into a vertical three panel design capturing distance, time and frequency of the underlying data. The dashboard is meant to be looked at and not interacted with, as a result the only thing the end user can do at this point is click on other tabs.

Bulk of the features in the application will appear on the performance reporting tab. The user will be able to select x-Axis, y-Axis features and y-Axis feature aggregation strategy in order to plot out the results in the right-hand chart area. Beneath the selection criteria are checkboxes for splitting the data out in to full season, pre-season, mid-season and post-season components. Finally, the user can filter out different activities by checking each activity for exclusion or inclusion.

The view of the raw data is to provide a view outlining all of the user’s tracks. Each track listing includes the name of the track, date, length, duration and a altitude sparkline. Clicking on a track reveals a speed and altitude plot along with the box plots described in the previous section.

### Development

Based on the planning done earlier in the project, development was a matter spending some time in IntelliJ, translating the design into the appropriate Android conventions and implementing the necessary logic to calculate various statistics and data calculations.

Much of what was needed to implement the application was available in the JDK and Android SDK, however there were a few areas I felt I could leverage existing open source libraries without having to roll my own solution (especially given the timeline I had decided upon):

• For charting I decided to use achartengine (1.0.0) since it looked to be the most stable and used charting library for Android.
• To parse the CSV file containing all of the track information, I went with opencsv (2.3) since it seems to most widely used. Although it does look like an Apache Commons CSV package is in the works but not yet final.
• Since the time and date handing in Java is embarrassingly lacking in JDK 1.6, I ended up using joda-time (2.1) for presenting end user friendly date and time representations.

In terms of code organization and events that take place, the application is structured around Android’s fragment approach to deal with having to provide different views based on the device being used. Since the focus of the application was to develop a tablet application, no additional layouts were developed to support smaller screen sizes. The main activity consists of loading data from an SD card and creating handlers for tab events for each of the three tabs. Each individual tab consists of a replicated paradigm of master-detail fragments and additional settings that are passed between fragments as bundles whenever an end user event takes place.

The referenced packages: common, controls, reporting and serialization contain classes for binding views to data, data aggregation (essentially a watered-down version of .NET’s LINQ and Haskell’s higher order functions), and classes for loading track data into memory.

### Post-production

With development complete, I set out to do some polishing and make the application visually consistent. To start things off, I wanted to settle on a color palette for the application. This was done by sampling the HSB space on 60 degrees increments of hue offset by 0, 15, and 30 degrees of hue, with fixed 100% saturation and 80% brightness giving a vibrant 24 color palette to work with.

Once the color scheme was established, it was a matter of going through the application and making sure each user interface element was using the correct color resource. Once the colors were applied, I focused on applying a consistent spacing between all of the UI elements- specifically 16 dp and 8 dp depending on the context and bordering between components. From there, each chart was examined to make sure that the axes units and labels were presented correctly. One gripe about achartengine is that I was unable to find a way to set the axis title’s padding, so there is some overlap between the axis value labels and the axis title.

With the application spruced up, it was on to icon design and selection. For the application icon I decided to to do a simple vector-based tri-folded map with an overlaid panel and chart.

For the icons to be used in the application, I used those found in Google’s My Tracks app since those icons are available under a Creative Commons 3.0 BY-SA license and represent the vast majority of data that would be imported. Worth noting that most of those icons are based on the National Park Service’s Map Symbols Collection. Future versions of Viderefit will likely switch over to the NPS set since they represent a more considerable collection of activities and the collection is under a public domain license.

Last thing to settle on was the name of the application. During development the name was simply “My Track Visualizer”, but I wanted the app to be more than that, so I decided on “SeeFit” or “cFit”, whichever happened to be available. After a Google search, neither were available so, I decided to use the Latin word for “to see”, Videre, and luckily “Viderefit” didn’t show up in any Google search results, so presumably nobody else has taken it.

### End result

After finishing post-production, the application was looking much more consistent and polished. Below are screenshots of each tab taken from my 10.1″ Acer Iconia Tab A500 development device.

### Future Work

In thinking about what it will take to get this product market worthy, there are a few things that stand out in my mind:

• Data importing – for this project the data resided on an external SD card and was loaded into memory from a parsed CSV file. Ideally, the end user should be able to import data from multiple applications and in varying formats such that the data is stored more efficiently using the built-in SQLite capabilities of the platform.
• Data exporting – one open question is how this application fits into the broader fitness application ecosystem and what exportable data is of use to other applications.
• Territory – the project omits any presentation of actual GPS data recorded during each session. For those individuals who are more interested in where they’ve been over their measurements, it seems a territory view would be a selling point. In addition, some form of integration done with the Google Maps API would also help visualize territory and speeds on the map over time.
• Additional screen support – right now the application was designed specifically for tablets, but other users may wish to use this application on their smartphone.
• Goal support – having reviewed other applications on the market, the idea of goal tracking is a recurring theme. To make the product more marketable, it would be useful to add in this kind of functionality.

### Conclusion

Reflecting back on the goals I set out at the beginning of the project, I feel I made an intuitive and easy to use product consisting of appropriate information visualization techniques; worked on a new platform other than my typical C# and Haskell environments; finally, managed to complete the project according to the timeline I established by finishing all work a full two weeks sooner than originally scheduled. Working on a new platform and hardware was a fun experience. With the work that I’ve done, I feel that I can proceed onto the next phase of the project and determine how to get the application ready for the marketplace and hopefully bring in some revenue in the process.

Written by lewellen

2012-11-01 at 8:00 am

Posted in Projects

Tagged with , , ,

## Category Recognition of Golden and Silver Age Comic Book Covers

### Introduction

#### Motivation

For a while now, I’ve been wanting to work on a computer vision project and decided for my next research focused project that I would learn some image processing and machine learning algorithms in order to build a system that would classify the visual contents of images, a category recognizer. Over the course of the summer I researched several techniques and built the system I had envisioned. The end result is by no means state of the art, but satisfactory for four months of on-and-off development and research. The following post includes my notes on the techniques and algorithms that were used in the project followed by a summary of the system and its performance against a comic book data set that was produced during development.

#### Subject Matter

The original subject matter of this project were paintings from the 1890s done in the Cloisonnism art style. Artists of the style are exemplified by Emile Bernard, Paul Gaugin and Paul Serusier. The style is characterized by large regions of flat colors outlined by dark lines; characteristics that would be easy to work with using established image processing techniques. During development, it became evident that no one approach would work well with these images. As an alternative, I decided to work with Golden and Silver Age comic book covers from the 1940s to 1960s which also typified this art style. Many of the comic books were drawn by the same individuals such as Jack Kirby, Joe Shuster and Bob Kane. As an added benefit, there are thousands of comic book covers available online compared to the dozens of Cloisonnism paintings.

### Image Processing

#### Representation

An image is a function, $I : \mathbb{Z}^2 \to \mathbb{Z}^3$, where each input vector, $\vec{x}$, represents an image coordinate and each output vector, $\vec{y}$, represents the red, blue and green (RGB) channels, $\vec{y}_c$, of an image. Individual input values are bound between zero and the width, $w$, or height, $h$, of the image and output values are between zero and $255$. Each output vector represents a unique color in RGB space giving rise to $2^{24}$ possible colors. Below is a basic sample image broken down into to its individual channels.

Like any other vector field, transformations can be applied to the image to yield a new image, $\hat{I}$. In image processing, these transformations are referred to as image filters and come in three varieties of point-based, neighbor-based and image-based filters. As the names suggest, point-based filters map single output vectors to a single output vector, neighbor-based filters map neighboring output vectors to a single output vector, and image-based filters map the whole image and a single or neighboring set of output vectors to a single output vector.

There are many different instances of these types of filters, but only those used in this project are discussed below. Computational complexity and efficient algorithms for each type of filter are also discussed where appropriate.

##### Point-based Filters

Point-based filters, $f : \mathbb{Z}^3 \to \mathbb{Z}^3$, map an output vector to a new output vector in the form $\hat{I}(\vec{x}) = f(I(\vec{x}))$. Application of a point-based filter is done in quadratic time with respect to the dimensions of the image $\mathcal{O}(N^2)$.

###### Grayscale Projection

It is helpful to collapse the RGB channels of an image down to a single channel for the purpose of simplifying filter results. This can be done by using a filter of the form $f(\vec{y})_c = \frac{ \lVert \vec{y} \rVert_2 }{ \sqrt{3} }$. Alternatively one can use a filter of the form $f(\vec{y})_c = (0.2126, 0.7152, 0.0722)^T \cdot \vec{y}$ to represent the luminescence of the output vector.

###### Thresholding

A threshold filter serves as a way to accentuate all values in the image greater than or equal to a threshold, $\gamma$, or to attenuate all those values less than the threshold.

The first variety is the step threshold filter, $f(\vec{y})_c = \begin{cases} 255 & \vec{y}_{c} \ge \gamma \\ 0 & \text{otherwise} \end{cases}$, which exhibits an ideal threshold cutoff after the threshold value.

The second variety is a logistic threshold filter, $\displaystyle f(\vec{y})_c = \frac{255}{ 1.0 + \exp{\sigma(\gamma - \vec{y}_c)} }$, with an additional parameter, $\sigma$, allowing for wiggle room about the threshold yielding a tapered step function as $\sigma$ increases in size.

##### Neighbor-based Filters

All neighbor-based filters take the output vectors neighboring an input vector to calculate a new output vector value. How the neighboring output vectors should be aggregated together is given by a kernel image, $K$, and the computation is represented as a two-dimensional, discrete convolution.

$\hat{I}_c = \displaystyle (I \star K)(u,v)_c = \sum_{i=-\infty}^{\infty} \sum_{j=-\infty}^{\infty} I(i,j)_c K(u-i,v-j)_c$

Neighbor-based filters can be applied naïvely in quartic time as a function of the image and kernel dimensions, $\mathcal{O}(N^4)$. However, a more efficient implementation allows for $\mathcal{O}(N^2 \log_2 N)$ time by way of the Discrete Fourier Transform.

$\displaystyle \mathcal{F}(f)(x) = \sum_{n=0}^{N-1} f(n) \exp{ -2 \pi i \frac{n}{N} x}$

The Discrete Fourier Transform is a way of converting a signal residing in the spatial domain into a signal in the frequency domain by aggregating waveforms of varying frequencies where each waveform is amplified by its corresponding value in the input signal. The Inverse Discrete Fourier Transform maps a frequency domain signal back to the spatial domain.

$\displaystyle \mathcal{F}^{-}(f)(x) = \frac{1}{N} \sum_{n=0}^{N-1} f(n) \exp{ 2 \pi i \frac{n}{N} x}$

Applying the Discrete Fourier Transform to a convolution, $\mathcal{F}(f \star g)$, comes with the convenient property that the transformed convolution can be rewritten as the product of the transformed functions, $\mathcal{F}(f) \mathcal{F}(g)$, by way of the Convolution Theorem.

The improved time complexity is achieved by using a divide a conquer algorithm known as the Fast Fourier Transform which takes advantage of the Danielson-Lanczos Lemma which states that the Discrete Fourier Transform of a signal can be calculated by splitting the signal into two equal sized signals and computing their Discrete Fourier Transform.

$\displaystyle \mathcal{F}(f)(x) = \sum_{n=0}^{\frac{N}{2} - 1} f(2n) \exp{ -2 \pi i \frac{2n}{N} x} + \sum_{n=0}^{\frac{N}{2} - 1} f(2n+1) \exp{ -2 \pi i \frac{2n+1}{N} x }$

$\displaystyle \mathcal{F}(f)(x) = \sum_{n=0}^{\frac{N}{2} - 1} f(2n) \exp{ -2 \pi i \frac{n}{N / 2} x} + \exp{ -2 \pi i \frac{x}{N} } \sum_{n=0}^{\frac{N}{2} - 1} f(2n+1) \exp{ -2 \pi i \frac{n}{ N / 2 } x }$

$\displaystyle \mathcal{F}(f)(x) = \mathcal{F}(f_{even})(x) + \mathcal{F}(f_{odd})(x) \exp{ -2 \pi i \frac{x}{N} }$

For the purposes of image processing, we use the two-dimensional Discrete and Inverse Discrete Fourier Transform.

$\displaystyle \mathcal{F}(f)(u, v) = \frac{1}{w h} \sum_{m = 0}^{w - 1} \sum_{n = 0}^{h-1} f(m,n) \exp{-2 \pi i \left (\frac{m u}{h} + \frac{nv}{w} \right) }$

The expression can be rearranged to be the Discrete Fourier Transform of each column in the image and then computing the resulting Discrete Fourier Transform of those results to obtain the full two-dimensional Discrete Fourier Transform.

$\displaystyle \mathcal{F}(f)(u, v) = \frac{1}{w} \sum_{m = 0}^{w-1} \left ( \frac{1}{h} \sum_{n = 0}^{h-1} f(m,n) \exp{-\frac{2 \pi i n v}{h}} \right) \exp{-\frac{2 \pi i m u}{w}}$

As a result, we can extend the Fast Fourier Transform in one dimension easily into two dimensions producing a much more efficient time complexity.

###### Weighted Means: Gaussian and Inverse Distance

Weighted mean filters are used to modify the morphology of an image by averaging neighboring output vectors together according to some scheme.

A Gaussian filter is used to blur an image by using the Gaussian distribution with standard deviation, $\sigma$, as a kernel.

$\displaystyle K(u, v)_c = \frac{1}{2 \pi \sigma^2} \exp{-\frac{u^2+v^2}{2 \sigma^2} }$

The inverse distance filter calculates how far the neighboring output vectors are with respect to the new output vector being calculated. Each result is also scaled by the parameter, $\gamma$, allowing for contrast adjustments.

$\displaystyle K(u, v)_c = \begin{cases} \gamma \lVert (u-i, v-j) \rVert^{-} & (u,v) \neq (i,j) \\ 0 & \text{otherwise} \end{cases}$

###### Laplacian

A Laplacian filter detects changes in an image and can be used for sharpening and edge detection. Much like in calculus of a single variable, the slope of a surface can be calculated by the Gradient operator, $\displaystyle \nabla = \left ( \frac{\partial}{\partial x}, \frac{\partial}{\partial y} \right )$. Since it is easier to work with a scalar than a vector, the magnitude of the gradient is given by the Laplacian operator, $\displaystyle \Delta = \nabla \cdot \nabla = \frac{\partial^2}{\partial x^2} + \frac{\partial^2}{\partial y^2}$.

Since an image is a discrete function, the Laplacian operator needs to be approximated numerically using a central difference. $h$ represents the spacing between successive samples of the underlying function. Since the finest resolution that can be achieved in an image is an individual displacement, $h = 1$.

$\displaystyle \Delta I \approx \frac{I(x + h, y) + I(x, y + h) - 4 I(x,y) - I(x - h, y) + I(x, y - h)}{h^2}$

$\displaystyle \Delta I \approx I \star K = I \star \frac{1}{h^2} \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix}$

##### Image-based Filters

Image-based filters calculate some information about the contents of the image and then use that information to generate the appropriate point-based and neighbor based filters.

###### Normalization

The normalization process computes the minimum, $\vec{y}_{c}^{\text{min}}$ and maximum, $\vec{y}_{c}^{\text{max}}$ values of each channel and linearly maps all values between those extrema to new values between the possible channel extrema of $c^{\text{min}} = 0$ and $c^{\text{max}} = 255$.

$\displaystyle \hat{I}(\vec{x})_c = \frac{I(\vec{x})_c - \vec{y}^{\text{min}}_c}{\vec{y}^{\text{max}}_c - \vec{y}^{\text{max}}_c} (c^\text{max} - c^\text{min}) + c^\text{min}$

This particular image-based filter can be applied in quadratic time, $\mathcal{O}(N^2)$, to calculate the extrema of the image and apply the linear map.

#### Edge Detection

Edge detection is the process of identifying changes (e.g., texture, color, luminance and so on) in an image. As alluded to in the image processing section, the Laplacian filter is central to detecting edges within an image. As a result A sequence of filters is used before and after a Laplacian filter to produce a detector that consistently segments comic book covers. The following sequence of filters was used.

1. Grayscale Projection – Since all logical components of a comic book cover are separated by inked lines, it is permissible to ignore the full set of RGB channel information and collapse the image down to a grayscale image.
2. Normalization – It is conceivable that the input image has poor contrast and brightness. To ensure that the full range of luminescence values are presented, the image is normalized.
3. Gaussian ($\sigma = 1.5$) – An image may have some degree of noise superimposed on the image. To reduce the noise, the image is blurred using a Gaussian filter with a standard deviation of $1.5$. This is enough to smooth out the image without distorting finer image detail.
4. Laplacian – Once the image has been prepared, its edges are calculated using the Laplacian filter.
5. Normalization – Most of the changes in the image may be subtle and need to make sure that all edge information is accentuated as much as possible by applying a normalization filter.
6. Step Threshold ($\gamma = 0.05$) – Since a partial edge isn’t particularly useful information, any edge RGB value less than $12.75$ is attenuated to zero and all other values accentuated to $255$.
7. Inverse Distance ($\gamma = 1.5$) – It is possible that during the threshold process that discontinuities were introduced into some of the edges. To mitigate this impact, an inverse distance filter is used to inflate existing edges and intensify the result with a gain of $1.5$.

The complete edge detection process takes computational complexity of $\mathcal{O}(N^2 \log_2 N)$ due to the neighbor-based filters used to eliminate noise and smooth edge discontinuities.

#### Segmentation

With the edge image it is possible to segment the image into its visual components. This is achieved by doing a flood fill on the image and using the edge image as the boundaries for the fill. Once the fill runs out of points to flood, the segment is complete and the next remaining point in the image is considered. To reduce the number of minuscule segments, only those segments representing $0.1 \%$ of the image are included.

### Machine Learning

#### Classifiers

The task of classification is to identify decision boundaries separating all of the classification within the data set. Such data sets can be linearly or non-linearly separable and as a result, classifiers were developed to solve the linear case and then adapted to deal with the more complicated non-linear case. While there are a number of classifiers, only the K-Nearest Neighbor and Support Vector Machine classifiers were researched and implemented in this project.

##### K-Nearest Neighbor

The K-Nearest Neighbor classifier is an online classifier which operates under the assumption that a yet to be classified vector is most likely to be the same classification as those $k$ training vectors which are closest to the vector based on a distance measure, $d : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$.

Distance can be measured in a variety of ways for arbitrary vectors, $\vec{x}, \vec{y} \in \mathbb{R}^n$, residing in some real space. The most common of which are specialized cases of the Minkowski distance.

$\displaystyle d_{p}(\vec{x},\vec{y}) = \left ( \sum_{i = 0}^{n} \lvert \vec{x}_{i} - \vec{y}_{i} \rvert^{p} \right )^{ \frac{1}{p} }$

The Manhattan distance, $d_{1}(\vec{x},\vec{y})$, yields the distance traveled along a grid between two vectors (hence a name in reference to the New York City borough). The Euclidean distance, $d_{2}(\vec{x}, \vec{y})$, gives the distance between the vectors in the usual familiar sense. The last specialized cased considered is the Chebyshev distance, $d_{\infty}(\vec{x},\vec{y})$, which gives the maximum distance between any one dimension of the two vectors.

Two factors affect the efficacy of the algorithm. The first is the dimension of the data, $n$, and the size of the train data set, $N$. As the training data set increases with size, there are more vectors which a test vector must be compared against. As a result, an efficient means of searching the training set must be used to yield satisfactory performance. This can be achieved by using kd-Trees which give $\mathcal{O}(\log N)$ search performance or branch and bound methods giving similar performance. As the dimensionality of the dataset increases, the efficacy of kd-Trees diminishes to a near linear search of the training data set due to the “curse of dimensionality.”

##### Support Vector Machine
###### Formulation

The Support Vector Machine classifier is an offline linear, binary classifier which operates under the assumption that a training set, $(\vec{x}, y)^{(i)} \in \mathcal{D}$, consists of linearly separable classifications, $y \in \lbrace -1, +1 \rbrace$, of data, $\vec{x} \in \mathbb{R}^n$, by some optimal hyperplane of the form $\langle \vec{w}, \vec{x} \rangle + b = 0$. Where $\langle \cdot, \cdot \rangle$ is the inner product, $\vec{w} \in \mathbb{R}^n$ and $b \in \mathbb{R}$. When $\langle \vec{w}, \vec{x} \rangle + b \ge 1$, then the classification $+1$ is presented and when $\langle \vec{w}, \vec{x} \rangle + b \le -1$, the classification $-1$ is presented.

The hyperplane is padded by two hyperplanes separated by an equal distance to the nearest training examples of each classification. The span between the supporting hyper planes is the margin. The goal then is to pick a hyperplane which provides the largest margin between the two separable classifications. The margin between the supporting hyperplanes is given by $\displaystyle \frac{2}{\lVert \vec{w} \rVert}$. Given the demarcation criteria, the maximum margin will also be subject to the constraint that all training examples satisfy $y^{(i)} (\langle \vec{w}, \vec{x}^{(i)} \rangle + b) - 1 \ge 0$. As a result of the objective function and accompanying linear constraint, the problem is stated in terms of its native primal Quadratic Programming form.

$\min \mathcal{L}_P(\vec{w}) = \frac{1}{2} \langle \vec{w}, \vec{w} \rangle$ subject to $y^{(i)} (\langle \vec{w}, \vec{x}^{(i)} \rangle + b) - 1 \ge 0$ $\forall (\vec{x}, y)^{(i)} \in \mathcal{D}$

To find the optimal parameters, it is easier to translate the problem into a dual form by applying the technique of Lagrange Multipliers. The technique takes an objective function, $f$, and constraint functions, $g^{(i)}$, and yields a new function $\mathcal{L}(\vec{x}, \vec{\alpha}) = f(\vec{x}) + \sum \vec{\alpha}_i g(\vec{x})^{(i)}$ to be optimized subject to the added constraint $\vec{\alpha}_i \ge 0$ $\forall i$.

$\displaystyle \max_{\vec{\alpha}} \mathcal{L}(\vec{w}, b, \vec{\alpha}) = \frac{1}{2} \langle \vec{w}, \vec{w} \rangle - \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_{i} (y^{(i)} (\langle \vec{w}, \vec{x}^{(i)} \rangle + b) - 1)$ $\vec{\alpha} \in \mathbb{R}^{\lvert \mathcal{D} \rvert }$ subject to $\vec{\alpha}_{i} \ge 0$ $\forall i$

The next step is to differentiate the objective function with respect to the parameters to determine the optimal solution. Since the function is concave, the results will yield the desired maximum constraints.

$\displaystyle \frac{\partial \mathcal{L}}{\partial b} = 0 \implies \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_i y^{(i)} = 0$ $\displaystyle \frac{\partial \mathcal{L}}{\partial \vec{w}} = 0 \implies \vec{w} = \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_i y^{(i)} \vec{x}^{(i)}$

As a result the dual problem can be written as the following:

$\displaystyle \max \mathcal{L}_D(\vec{\alpha}) = \frac{1}{2} \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \sum_{j=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_i \vec{\alpha}_j y^{(i)} y^{(j)} \langle \vec{x}^{(i)}, \vec{x}^{(j)} \rangle - \sum_{k=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_k$ subject to $\vec{\alpha}_{i} \ge 0$ $\forall i$, $\displaystyle \sum_{j=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_j y^{(j)} = 0$

###### Handling of non-linearly separable data

In the event that the data is not linearly separable, then an additional parameter, $C$, is added as a penalty factor for those values that reside on the wrong side of the hyperplane. The derivation for the quadratic program is identical to the one presented above with the exception that the lagrange multipliers now have an upper boundary $0 \le \vec{\alpha}_i \le C$ $\forall i$.

###### Non-linear classification

By way of Mercer’s Theorem, the linear Support Vector Machine can be modified to allow for non-linear classification through the introduction of symmetric, positive semidefinite kernel functions, $\Phi : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$. The idea being that if the data is not linearly separable in its present dimensional space that by mapping it to a higher dimensional space that the data may become linearly separable by some higher dimensional hyperplane. The benefit of a kernel function is that the higher dimensional vector need not be computed explicitly. This “kernel trick” allows for all inner products in the dual representation to be substituted with a kernel.

$\displaystyle \max \mathcal{L}_D(\vec{\alpha}) = \frac{1}{2} \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \sum_{j=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_i \vec{\alpha}_j y^{(i)} y^{(j)} \Phi(\vec{x}^{(i)}, \vec{x}^{(j)}) - \sum_{k=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_k$ subject to $0 \le \vec{\alpha}_{i} \le C$ $\forall i$, $\displaystyle \sum_{j=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_j y^{(j)} = 0$

And the decision hyperplane function then becomes:

$\displaystyle f(\vec{x}) = \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \alpha_i y^{(i)} \Phi (\vec{x}^{(i)}, \vec{x}) + b$

The following are some typical kernels:

• Linear – $\Phi(\vec{x}, \vec{y}) = \langle \vec{x}, \vec{y} \rangle$
• Polynomial – $\Phi(\vec{x}, \vec{y}) = (\gamma \langle \vec{x}, \vec{y} \rangle + r)^d$ $\gamma > 0$
• Radial basis function – $\Phi(\vec{x}, \vec{y}) = \exp{-\gamma \langle \vec{x} - \vec{y}, \vec{x} - \vec{y} \rangle}$ $\gamma > 0$
• Sigmoid – $\Phi(\vec{x}, \vec{y}) = \tanh (\gamma \langle \vec{x}, \vec{y} \rangle + r)$

From a practical point of view, only the linear and radial basis function kernels from this list should be considered since the polynomial kernel has too many parameters to optimize and the sigmoid kernel does not satisfy the positive semidefinite kernel matrix requirement of Mercer’s Theorem.

###### Algorithmic details

The Support Vector Machine classifier can be implemented using a quadratic programming solver or by incremental descent algorithms. Both methods work, but are difficult to implement and expensive to procure. An alternative is the Sequential Minimal Optimization algorithm developed by John Platt at Microsoft Research. The algorithm works by analytically solving the dual problem for the case of two training examples then iterating over all of the lagrange multipliers verifying that the constraints are satisfied. For those that are not, the algorithm computes new lagrange multiplier values. The full details of the algorithm can be found in Platt’s paper.

The time complexity of the algorithm is quadratic with respect to the number of training samples and support vectors $\mathcal{O}(N^2)$.

The time complexity of evaluating the decision function is linear with respect to the number of support vectors $\mathcal{O}(N)$.

#### Multiclass Classification

The classification methods presented in the previous section are utilized as binary classifiers. These classifiers can be used to classify multiple classifications by employing a one-vs-all or all-vs-all approach. In the former a single classification is separated from the remaining classifications to produce $N$ classifiers for the $N$ classifications. Each classifier is then used to evaluate a vector and the classifier with the highest confidence is then used to declare the classification.

In the latter, a single classification is compared individually to each other classification resulting in $\frac{N(N - 1)}{2}$ classifiers. All of the classifiers are then evaluated against the test vector and the classification with the greatest consensus from the classifiers is declared the classification of the test vector.

Both methods have their place. The benefit of a one-vs-all approach is that there are fewer classifiers to maintain. However, training a single classifier on a complete data set is time consuming and can give deceptive performance measures. All-vs-all does result in more classifiers, but it also provides for faster training which can be easily parallelized on a single machine and distributed to machines on a network.

#### Classifier Evaluation

Individual classifiers are evaluated by training the classifier against a data set and then determining how many correct and incorrect classifications were produced. This evaluation produces a confusion matrix.

Predicted Classification
Positive Negatives Total
Actual Classification Positive (TP) True Positive (FN) False Negative (AP) Actual Positives
Negatives (FP) False Positive (TN) True Negative (AN) Actual Negatives
Total (PP) Predicted Positives (PN) Predicted Negatives (N) Examples
Confusion matrix defintion and associated terms.

The confusion matrix is used to calculate a number of values which are used to evaluate the performance of the classifier. The first of which is the accuracy and error of the classifier. Accuracy measures the number of instances where the actual and predicted classifications matched up and the error for when they do not.

$\displaystyle \text{Accuracy} = \frac{TP + TN}{N}$ $\displaystyle \text{Error} = \frac{FP + FN}{N}$

Since we should expect to get different results each time we evaluate a classifier, the values that we obtain above are sample estimates of the true values that are expected. Given enough trails and measurements, it is possible to determine empirically what the true values actually are. However, this is time consuming and it is instead easier to use confidence intervals to determine what interval of values a measurement is mostly likely to fall into.

#### Training and Testing

Each of the classifiers presented have some number of parameters that must be determined. The parameters can be selected by having some prior knowledge or by exploring the parameter space and determining which parameters yield optimal performance. This is done by performing a simple grid search over the parameter space and evaluating and attempting to minimize the error.

K-folds cross-validation is used at each grid location to produce a reliable measure of the error. The idea is that a data set is split into $k$ disjoint sets. The first set is used as a validation set and the remaining $k - 1$ sets are used in unison as the training data set for the classifier. This process is done on the next set and so on until all $k$ sets have been used as a validation set.

### System

#### Implementation

The system was implemented in C# 4.0 on top of the Microsoft .NET Framework. The user interface was written by hand using the WinForms library. No other third-party libraries or frameworks were used. When possible, all algorithms were parallelized to take advantage of multi-core capabilities to improve processing times.

#### Summary

The system consists of two modes of operation: training and production. In training, a human classifier labels image segments with an appropriate classification. New image segments are then taken into consideration during the training of machine learning algorithms. Those algorithms producing the lowest error for a given classification are then used in production mode. During production, a user submits an image and each image segment is then evaluated against the available classifiers. Those image segments are then presented to the user with the most likely classification. These two modes along with their workflows and components are illustrated in the following diagram.

#### Training Mode

##### Data Set Construction

The user interface of the system allows users to add an image segment to a local data set of images. Once added, the image is then processed to yield image segments. The user can then label an image segment by editing the segment and moving on to the next image segment. This allows for easy and efficient human classification of data. If the user does not wish to keep the image, he or she may remove the image from the data set as well.

##### Data Set Cleaning

During the construction phase, errors may be introduced into the data set typically in the case of typos or forgetting which segment was currently being edited. The data set is cleaned by listing out all available classifications and presenting the user with all available segments associated with that classification. The user can then review the image segment as it was identified in the source image. If the user does not wish to keep the classification, he or she may remove the image from the data set as well.

##### Data Set Statistics

The data set consists of 496 comic book covers pulled from the Cover Browser database of comic book covers. The first 62 consecutive published comic book covers where used from Action Comics, Amazing Spider-man, Batman, Captain America, Daredevil, Detective Comics, Superman, and Wonder Woman and then processed by the image processing subsystem yielding 24,369 image segments. 11,463 of these segments represented classifiable segments which were then labeled by hand over the course of two weeks; the remaining segments were then discarded.

In total, there were 239 classifications identified in the data set among 18 categories. Text, clothing, geography, and transportation categories accounting for 90% of the data set. Since the majority of classification were incidental, only those classifications having 50 or more image segments were considered by the application leaving a total of 38 classifications.

##### Classifier Evaluation

For the 38 classifications meeting the minimum criteria for classification, the K-Nearest Neighbor approach worked well in distinguishing between text classifications from other classifications and between intra-text classifications for both all-vs-all and one-vs-all schemes.

 All-vs-All K-Nearest Neighbor Performance. One-vs-All K-Nearest Neighbor Performance.

The Support Vector Machine approach presented unremarkable results for both all-vs-all and one-vs-all methods. In the former, only a few pairings resulted in acceptable error rates whereas the later presented only a couple acceptable error rates.

 All-vs-All Support Vector Machine Performance. One-vs-All Support Vector Machine Performance.

For both classification methods presented, the all-vs-all method yielded superior results to the one-vs-all method. In comparing the two classifier methods, the K-Nearest Neighbor seems to have done better than the Support Vector Machine approach, contrary to what was expected from literature. Both classifier methods are used in production mode.

#### Production Mode

Production mode allows the end user to add an image to the data set and then review the most likely classifications produced by evaluating each image segment against the available set of classifiers. The end user is then expected to review each segment and accept or reject the suggested classification. Aside from this additional functionality, production mode is nearly identical in functionality to training mode.

### Conclusions

The time spent on this project was well spent. I met the objectives that I laid out at the beginning of the project and now have a better understanding of the image processing algorithms and machine learning concepts from a theoretical and practical point of view.

### Future Work

#### Segmentation

One issue with the existing implementation is that it over segments the image. Ideally, fewer segments would be produced that are more closely aligned with their conceptual classification. There are a number of popular alternatives to the approach taken, such as level set methods, which should be further investigated.

#### Classification

The approach taken to map scaled versions of the image segments to a $2^{10}$ space is simple to implement, but it did not assist well in the classification process. Alternative mappings such as histogram models should be evaluated in the future to decrease classification times and to determine if classification error rates can be reduced.

#### System User Interface

While it was disappointing to have spent so much time building a data set only to have to limit what was considered, it assisted me in building a user interface that had to be easy and fast to use. The application can certainly be developed further and adapted to allow for other data sets to be constructed, image segmentation methods to be added and additional classifications to be evaluated.

#### System Scalability

The system is limited now to a single machine, but to grow and handle more classifications, it would need to be modified to run on multiple machines, have a web-based user interface developed and a capable database to handle the massive amounts of data that would be required to support a data set on the scale of the complete Cover Browser’s or similar sites’ databases (e.g., 450,000 comic book covers scaled linearly would require 546 GiB of storage.) Not to mention data center considerations for overall system availability and scalability.

### References

Aly, Mohamed. Survey on Multiclass Classification Methods. [pdf] Rep. Oct. 2011. Caltech. 24 Aug. 2012.

Asmar, Nakhle H. Partial Differential Equations: With Fourier Series and Boundary Value Problems. 2nd ed. Upper Saddle River, NJ: Pearson Prentice Hall, 2004. Print.

Bousquet, Olivier, Stephane Boucheron, and Gabor Lugosi. “Introduction to Statistical Learning Theory.” [pdf] Advanced Lectures on Machine Learning 2003,Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003 (2004): 169-207. 7 July 2012.

Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization [pdf]. N.p.: Cambridge UP, 2004. Web. 28 June 2012.

Burden, Richard L., and J. Douglas. Faires. Numerical Analysis. 8th ed. Belmont, CA: Thomson Brooks/Cole, 2005. Print.

Caruana, Rich, Nikos Karampatziakis, and Ainur Yessenalina. “An Empirical Evaluation of Supervised Learning in High Dimensions.” [pdf] ICML ’08 Proceedings of the 25th international conference on Machine learning (2008): 96-103. 2 May 2008. 6 June 2012.

Fukunaga, Keinosuke, and Patrenahalli M. Narendra. “A Branch and Bound Algorithm for Computing k-Nearest Neighbors.” [pdf] IEEE Transactions on Computers (1975): 750-53. 9 Jan. 2004. 27 Aug. 2012.

Gerlach, U. H. Linear Mathematics in Infinite Dimensions: Signals, Boundary Value Problems and Special Functions. Beta ed. 09 Dec. 2010. Web. 29 June 2012.

Glynn, Earl F. “Fourier Analysis and Image Processing.” [pdf] Lecture. Bioinformatics Weekly Seminar. 14 Feb. 2007. Web. 29 May 2012.

Gunn, Steve R. “Support Vector Machines for Classification and Regression” [pdf]. Working paper. 10 May 1998. University of Southampton. 6 June 2012.

Hlavac, Vaclav. “Fourier Transform, in 1D and in 2D.” [pdf] Lecture. Czech Technical University in Prague, 6 Mar. 2012. Web. 30 May 2012.

Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin. A Practical Guide to Support Vector Classification. [pdf] Tech. 18 May 2010. National Taiwan University. 6 June 2012.

Kibriya, Ashraf M. and Eibe Frank. “An empirical comparison of exact nearest neighbour algorithms.” [pdf] Proc 11th European Conference on Principles and Practice of Knowledge Discovery in Databases. (2007): 140-51. 27 Aug. 2012.

Marshall, A. D. “Vision Systems.” Vision Systems. Web. 29 May 2012.

Panigraphy, Rina. Nearest Neighbor Search using Kd-trees. [pdf] Tech. 4 Dec. 2006. Stanford University. 27 Aug. 2012.

Pantic, Maja. “Lecture 11-12: Evaluating Hypotheses.” [pdf] Imperial College London. 27 Aug. 2012.

Platt, John C. “Fast Training of Support Vector Machines Using Sequential Minimal Optimization.” [pdf] Advances in Kernel Methods – Support Vector Learning (1999): 185-208. Microsoft Research. Web. 29 June 2012.

Sonka, Milan, Vaclav Hlavac, and Roger Boyle. Image Processing, Analysis, and Machine Vision. 2nd ed. CL-Engineering, 1998. 21 Aug. 2000. Web. 29 May 2012.

Szeliski, Richard. Computer vision: Algorithms and applications. London: Springer, 2011. Print.

Tam, Pang-Ning, Michael Steinbach, and Vipin Kumar. “Classification: Basic Concepts, Decision Trees, and Model Evaluation.” [pdf] Introduction to Data Mining. Addison-Wesley, 2005. 145-205. 24 Aug. 2012.

Vajda, Steven. Mathematical programming. Mineola, NY: Dover Publications, 2009. Print.

Welling, Max. “Support Vector Machines“. [pdf] 27 Jan. 2005. University of Toronto. 28 June 2012

Weston, Jason. “Support Vector Machine (and Statistical Learning Theory) Tutorial.” [pdf] Columbia University, New York City. 7 Nov. 2007. 28 June 2012.

Zhang, Hui, Jason E. Fritts, and Sally A. Goldman. “Image Segmentation Evaluation: A Survey of Unsupervised Methods.” [pdf] Computer Vision and Image Understanding 110 (2008): 260-80. 24 Aug. 2012.

Images in this post are used under §107(2) Limitations on exclusive rights: Fair use of Chapter 1: Subject Matter and Scope of Copyright of the of the Copyright Act of 1976 of Title 17 of the United States Code.

Written by lewellen

2012-10-01 at 8:00 am

## Summer

It’s time for my annual “It’s too beautiful outside to be inside” post. As next Fall begins to creep back onto my calendar, I’ll be wrapping up some projects that I’m currently working on and starting some new ones. Hoping to finish an Android tablet platformer, a computer vision project and a machine translation system by the end of the year. Smaller posts falling into the realm of algorithms, business and finance are also in the queue.

If you enjoy the content on this site, I recommend adding it to your RSS reader and keeping an eye out for future posts. In the mean time, take a look through the archives and find something interesting to read.

Written by lewellen

2012-07-01 at 8:00 am

Posted in Announcements

## Tropical Representation of the All-Pairs Shortest Path Problem

### Motivation

While I was doing my Abstract Algebra research the other month, I came across an interesting way of simplifying the representation of the all-pairs shortest path problem using Tropical Geometry. I thought it was pretty clever, so I thought I’d do a quick write-up.

### Problem Statement

The all-pairs shortest path problem is to identify the minimum path cost, $\Omega(\pi) = \sum_{e \in \pi} \omega(e)$, out of the possible paths $\pi_{i,j} \in \Pi_{i,j}$ between vertices $v_{i}$ and $v_{j}$.

### Proposition

Consider a weighted directed graph (digraph), $G = (V, E, \omega)$, consisting of vertices, $V$, and directed edges (arcs), $E \subseteq V \times V$, and a function, $\omega : E \to \overline{\mathbb{R}}_{+}$, yielding the weight of an edge. Only those weights from the positive affinely extended real numbers, $\overline{\mathbb{R}}_{+} = \mathbb{R}_{+} \cup \lbrace \infty \rbrace$, are allowed per the problem statement. The adjacency matrix representation, $D \in \overline{\mathbb{R}}_{+}^{\lvert V \rvert \times \lvert V \rvert}$, of $G$ is given by the following matrix:

$D_{i, j} = \begin{cases} 0 & i = j \\ \omega((v_{i}, v_{j})) & (v_{i}, v_{j}) \in E \\ \infty & \text{otherwise} \end{cases}$

Now, consider a semi-ring over $x, y \in \overline{\mathbb{R}}_{+}$ whose additive operator, $\oplus \in \overline{\mathbb{R}}_{+} \to \overline{\mathbb{R}}_{+}$, is given by the minimum function, $x \oplus y = \min(x,y)$, and whose multiplicative operator, $\odot \in \overline{\mathbb{R}}_{+} \to \overline{\mathbb{R}}_{+}$, is given by addition, $x \odot y = x + y$. The additive unit is given by infinity, $x \oplus \infty = x$, and the multiplicative unit by zero, $x \odot 0 = x$. This semi-ring is the Tropical Semi-ring $\mathbb{T} = \left ( \overline{\mathbb{R}}_{+}, \oplus, \odot \right )$. (The namesake of tropical is in honor of Brazilian Imre Simon who developed this branch of mathematics.)

Linear algebra constructs can be tropicalized to yield the familiar definitions for matrix addition and multiplication for matricies $A, B \in \overline{\mathbb{R}}_{+}^{n \times m}$ and $C \in \overline{\mathbb{R}}_{+}^{m \times n}$.

$\displaystyle \left (A \oplus B \right )_{i, j} = A_{i,j} \oplus B_{i, j}$

$\displaystyle (A \odot C)_{i,j} = \bigoplus_{k}^{m} A_{i, k} \odot C_{k, j}$

Given the two prior statements, the elegant solution to the all-pairs shortest path problem is given by taking powers of the adjacency matrix: $D_{G}^{\odot \lvert V \rvert - 1}$.

### Proof

To see how this works out, start with $D_{G}$. The matrix represents the minimum cost between any two adjacent vertices. In other words, the minimum cost for all paths containing a single edge. The next inductive step is to consider paths containing at most two adjacent edges. Squaring the adjacency matrix yields all such paths. When the matrix is squared, each edge is concatenated to all other adjacent edges and the minimum weight of the paths is selected. This thought process can iterated as follows:

$\displaystyle D_{G}^{\odot r} = D_{G}^{\odot r - 1} \odot D_{G}$
$\displaystyle D_{i, j}^{\odot r} = \bigoplus_{k}^{m} D_{i, k}^{\odot r - 1} \odot D_{k, j}$
$\displaystyle D_{i, j}^{\odot r} = \min { \lbrace D_{i, k}^{\odot r - 1} + D_{k, j} \rbrace }$ $\forall k \in [1, m]$

The result is a typical Bellman equation. A graph can have at most $\lvert V \rvert - 1$ edges between any two vertices, thus, the solution to the all-pairs shortest path problem is given by $\displaystyle D_{G}^{\odot \lvert V \rvert - 1}$.

### Example

As a worked example, consider the following graph whose set of vertices is given by the set $V = \lbrace a, b, c, d \rbrace$, set of arcs by $E = \lbrace (a,b), (a,c), (a,d), (b,c), (b, d), (c,d) \rbrace$ and weight function, $\omega$, as labeled on the graph.

The all-pairs shortest paths are given by the following calculations where the row and column coordinates correspond to the vertices of $V$. Values in bold denote a change in the shortest path between two vertices.

$D_{G} = \begin{pmatrix}0 & 1 & 8 & 12\\\infty & 0 & 2 & 10\\\infty & \infty & 0 & 3 \\ \infty & \infty & \infty & 0 \end{pmatrix}$ $D_{G}^{\odot 2} = \begin{pmatrix}0 & 1 & \textbf{3} & \textbf{11}\\\infty & 0 & 2 & \textbf{5}\\\infty & \infty & 0 & 3 \\ \infty & \infty & \infty & 0 \end{pmatrix}$ $D_{G}^{\odot 3} = \begin{pmatrix}0 & 1 & 3 & \textbf{6}\\\infty & 0 & 2 & 5\\\infty & \infty & 0 & 3 \\ \infty & \infty & \infty & 0 \end{pmatrix}$

### Computational Complexity

From asymptotic standpoint, tropical matrix multiplication is still on the order of traditional matrix multiplication of $\mathcal{O}(\lvert V \rvert^{3} )$. Computing the all-pairs shortest path problem using this approach is on the order of $\mathcal{O}(\lvert V \rvert^{4})$ since we must perform the tropical matrix multiplication $\lvert V \rvert - 1$ times. Now, This can be improved slightly since tropical matrix multiplication is associative, so we can leverage the repeated squaring approach and reduce the time complexity down to $\mathcal{O}(\lvert V \rvert^{3} \log{\lvert V \rvert})$.

The time complexity can be further reduced to $\mathcal{O}(\lvert V \rvert^{3} )$ using the Floyd-Warshall Algorithm, which is another dynamic programming approach that is similar in form to the tropical representation of the problem. In essence, it follows the same base case, but it’s recurrence statement only considers a range of vertices with respect to the two vertices being considered. A more in depth review of the algorithm can be found in the references.

### References

Floyd-Warshall’s Algorithm.” Algorithmist. Web. 12 Apr. 2012.

Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. “25.2 The Floyd-Warshall Algorithm.” Introduction to Algorithms. 2nd ed. Cambridge, MA: MIT, 2001. 629-35. Print.

Diestel, Reinhard. Graph theory. Heidelberg New York: Springer, 2010.

Laface, Antonio. Introduction to Tropical Geometry [pdf]. 29 Nov. 2006. Web. 11 Apr. 2012.

Maclagan, Diane, and Bernd Sturmfels. Introduction to Tropical Geometry [pdf]. 4 Nov. 2009. Web. 9 Apr. 2012.

Mohri, Mehryar. “Semiring Frameworks and Algorithms for Shortest-Distance Problems” [pdf]. Journal of Automata, Languages and Combinatorics 7 (2002) 3: 321-50. 8 Aug. 2002. Web. 31 Mar. 2012.

Written by lewellen

2012-06-01 at 8:00 am