Technical Skills Summary

Professional Societies

Eta Kappa Nu

Tau Beta Pi



Nearly Optimal N-ary Search Tree

Master's Thesis

C++, MySQL, C#, Javascript

The advent of a data centered scientific community as well as vast improvements in hardware necessary to store said data has caused a veritable explosion of available data in the world today. Terabyte upon terabyte of information is compressed and stored in a variety of structures. From the simple list to complex maps, the goal of data storage has always been fast and reliable retrieval while being compact and consistent. This model is a great platform. It is simple to teach, trivial to understand, and standard across the world given a set of consistent data.

However real data is different, especially when considering application or aggregated data. Consider a word counting application. It can be assumed that the words found in a particular text belong to some subset of the English (for the purposes of this example) language. By nature, some words in the dictionary will occur more often than others. So, would it not be intuitive to store these entries or nodes, as will be referred to, such that there is less overhead to retrieve the node from the data structure?

Indeed caching structures work in the same way. They retrieve blocks of data for repeated use. However when the data grows to several orders of magnitude, caches become less effective. Furthermore, caches can only retrieve in cache lines, when in reality, data sets with distributed frequency may not have nodes with a high relative frequency in the same line.

There becomes a necessity to not only accept that data contains an inherent frequency distribution, but also to exploit and utilize it in order to reduce average performance overheads when accessing the data structure. This paper introduces the Nearly Optimal N-ary Search Tree (NONST), a data structure that is designed to reduce overhead costs when accessing the data contained within it. By using the frequencies given by the nodes, a NONST is able to structure itself using the principles of a B+ tree while structuring nodes in a nearly optimal fashion. Meaning, the configuration of the nodes in the tree will lead to an average search cost that is near the minimum possible average search cost for that set of nodes.

In modern day applications where data access is a recurring and predictable, a structure such that the average access time is reduced would be highly beneficial to performance.

Feature Driven Geographical Districting

Project for Machine Learning

Python, MySQL

Politics affect everything: where we live, where we vote, and who becomes our leaders. Through a process called gerrymandering, politicians can effectively decide elections before the first ballot is cast by manipulating the borders of voting districts. In order to remove politics from the process of drawing borders, we present a method for feature driven geographical districting. Using K-means as an execution model, a variety of seed methods and evaluation algorithms were used to create districts that reduced the deviation of the population between districts while maintain a high ratio of continuity within the district. Using this evaluation technique, we present a map that achieves these goals. While the solution presented is not the optimal solution, this paper presents a platform for further study and investigation.

Parallel Stastics Analyzer (PaStA)

Project for High Performance Computing

C++, MySQL, OpenMP, MPI, CUDA, OpenACC

In order to clearly understand data in a dataset, it is necessary to see the distribution of the data. By creating a histogram of the data allows users to see the distribution of the dataset while also highlighting any anomalies that may exist. The Parallel Statistic Analyzer (PaStA) scans over a given dataset and produces such a histogram based on the frequency of entries that exist within a certain standard deviation range while also displaying the mean and standard deviation of the data. Such a histogram can help a data administrator note trends while also identifying possible outliers agnostic of the dataset provided. The program is designed to work with any set of numbers. However for the purposes of testing, a random number generator library (rngs and rvgs) with a set seed will be used to generate a geometric distribution of high variance. This generator was chosen because of its high variance, resulting in larger, more meaningful histograms. It was designed by Steve Park and Dave Geyer in conjunction with a paper on random number generation. It will also provide similar histogram results regardless of set size.

Intuitive Teleprescence for the Disabled

Capstone Project


Our Capstone project takes the concept of Telepresence and applies it to assisted living applications for handicapped individuals. Telepresence is the feeling of being elsewhere as emulated with virtual reality technology. For this project, we plan on designing a remote-controlled robot that is equipped with a robotic manipulator arm, a lift system, and a camera gimbal system to get a better view of the environment. The camera system will stream video live to an Oculus Rift VR headset which, when worn, will create the feeling of telepresence as if the user was travelling along with the robot. The head tracking capability provided by the Rift will correspond to the camera gimbal's movement, allowing the user to freely look around the surroundings of the robot to provide an unparalleled field of view. Finally, by using the Nintendo Wii Remote to provide users with an intuitive motion control device to manipulate the robotic arm, the final product will be very simple to use for accessibility purposes. Our hope is that the final product will be modular and extensible, allowing our concept to be used for a variety of other applications as well.

Sensitivity Analysis of the Nearly Optimal Binary Search Tree

Project for Simulation and Performance Analysis

C++, MySQL

Traditional Binary Search Trees (BST or BT) offer powerful tools to effectively organize and search for data given a known search parameter. Our approach, the Probability Binary Search Tree (PBST or PBT), considers the known probability of each node in order to balance the probability of the left and right sub trees at any given node while reducing the number of subsequent tree levels. Our experiments test the effectiveness of our structure against a known dataset in order to show the superiority of our algorithm. Further experiments show the effectiveness of our algorithm based on differing data set sizes and distributions. Our conclusions show that our PBST will not perform worse than the BST and will in fact see clear performance advantages.

© 2017 - Dominic J. Catalano