A.J. Goldsman

About

I am a Data Scientist with a background in Chemical & Biomolecular Engineering from UCLA, specializing in the intersection of deep technical implementation and high-level business strategy.

Currently a Data Science Manager at Capital One, I lead the development of high-performance Python frameworks that empower enterprise-wide teams to aggregate metrics and derive actionable insights from massive datasets. My career has spanned critical roles for the U.S. Census Bureau's Economy-Wide Statistics Division and Deloitte’s Risk & Financial Advisory practice, consistently focusing on Natural Language Processing, Machine Learning, and Process Automation.

My engineering roots at UCLA instilled a "first-principles" approach to problem-solving, which I now apply to the world of Big Data. Since my 2018 pivot at NovellusDX, I’ve been obsessed with turning raw information into strategic assets. I'm currently living in the Washington, D.C. area, so I’m always looking to collaborate with fellow tech enthusiasts in the DMV region; please reach out and let's connect!

Core Tech Stack

Programming Languages

Python, SQL, Javascript, MATLAB, C++

Python Libraries

NumPy, Pandas, Matplotlib, Seaborn, Sklearn, SpaCy, NLTK, Bokeh, Flask, Plotly, Transformers, Haystack, SQLGlot, SQLFluff

Software Tools

QGIS, Jenkins, Vim, Elasticsearch, Kibana, NiFi, Git, Snowflake

Case Studies

Performed exploratory data analysis (EDA) using Winsorization to handle outliers, BoxCox transformations for normalization, and heatmaps for correlation analysis. Isolated the most statistically significant features impacting life expectancy to inform predictive modeling.
Factors Affecting Global Life Expectancy: Identifying key public health drivers in a complex global dataset
Built a Natural Language Processing (NLP) pipeline utilizing a Bernoulli Naive-Bayes classification algorithm. Successfully categorized reviews as having positive or negative sentiment with high precision and recall.
IMDB Sentiment Classification: Automating the detection of viewer sentiment in thousands of unstructured movie reviews
Developed a comprehensive research proposal, rollout plan, and A/B test evaluation strategy based on domestic flight data. Provided a data-driven framework for airlines to validate revenue-generating business decisions.
U.S. Domestic Flight Analytics: Determining how U.S. airlines can optimize operations to improve revenue
Comparative analysis of four regression models: Ordinary Least Squares (OLS), Lasso, Ridge, and ElasticNet. Optimized the models to minimize error, identifying Lasso and Ridge as the most effective for preventing overfitting in high-dimensional data.
Predictive Modeling House Prices: Accurately predicting residential sale prices modeled on over 70 data features
Implemented Decision Tree and Random Forest classifiers, utilizing NLP for feature engineering. Demonstrated the superior accuracy of Random Forest ensembles in handling non-linear weather relationships.
Weather Classification via Decision Trees: Predicting weather patterns using historical atmospheric data
Engineered features from text using NLP and trained a Support Vector Machine (SVM) classifier. Established a model capable of predicting if a recipe would be highly rated based solely on its textual composition.
Recipe Rating Prediction (NLP & SVM): Predicting user ratings from key terms and descriptive tags
Compared the performance of a Multi-Layer Perceptron (MLP) neural network against Gradient Boosting machines. Determined that Gradient Boosting provided more robust results for this tabular dataset compared to the deep learning approach.
Precipitation Classification via Neural Networks: Classifying types of precipitation using complex meteorological sensor data

Side Projects

Overview

Dr. Playlist

NER Annotator

Kinship Linkage Tree

Overview

In my spare time, I enjoy building side projects that focus on bridging the gap between deep-backend data engineering and accessible user-facing utility to empower non-technical stakeholders. To date, I have released three Flask web services hosted via Render, but please keep an eye on this page in the future for other projects I currently have in flight.

Dr. Playlist

Several years ago, I teamed up with another UCLA alumnus with the goal of utilizing data science and machine learning to improve music classification beyond what current algorithms achieve, with a focus on the tracks' genres/styles, via analysis of instrument type, chord progressions, meter, and several other features. Although the end goal of Dr. Playlist is to be able to classify any given song based on those features, we had to first explore the limitations of current models in order to determine the best ways to proceed with this project.

Browser-based, Render-hosted Flask app that provides Disney song recommendations based on a user-inputted song. Dataset features obtained by querying SpotiPy and LyricGenius APIs:
Dr. Playlist - Disney Edition

Analysis and visualization of a subset of the Million Songs Dataset, which was originally compiled by researchers at Columbia University:
Million Songs Dataset
Analysis and visualization of data extracted via web scraping with Scrapy from a website of Disney song lyrics:
Web Scraping: Disney Song Lyrics Analysis
Predictive genre classification for a set of 100,000+ tracks of music, with a comparison of the accuracies of several supervised machine learning models:
Dr. Playlist - Initial Analysis and Genre Classification via Supervised Machine Learning

Named Entity Recognition (NER) Annotator

Built a custom SpaCy-integrated web application to solve the high cost of manual data labeling. The tool pre-identifies entities via a trained classifier, allowing users to rapidly validate or correct annotations via a Flask-based UI.

Reduces manual labeling time by ~60% through pre-annotation.
Exports directly to SpaCy .spacy and .json formats.

Over the course of my career, I have on many occasions needed to manually annotate training data for machine learning applications, especially those related to Natural Language Processing. Unfortunately, although there are several annotation tools available online, they all felt too limited for my purposes or too expensive. So instead, I developed my own annotation tool, demoed here using Recipes data, for which I have trained a Spacy Named Entity Recognition (NER) classifier to attempt to pre-identify and annotate relevant entities. The user can then make their own annotations by highlighting a word or phrase with their mouse.

Annotations can be changed or removed by clicking on the colored box surrounding your selection and selecting the appropriate option.

Final annotations, stored via the user's click of the "Record Annotations" button, are uploaded to this Google sheet.

Kinship Linkage Tree

A Pythonic solution via Render-hosted Flask app for visualizing hierarchical and complex data structures. This engine automatically generates complex linkages from dynamic datasets using Graphviz and Flask.

Handles dynamic depth scaling for massive relationship trees.
Optimized SVG rendering for browser performance.

In 2022, I was tasked with keeping an electronic record of my family's geneaology, to serve as a replacement for the paper copy being curated manually by my great-uncle.
Instead of simply using text boxes and arrows, I opted for an automated Pythonic solution that has longevity, one that will allow us to continue to add new data for generations to come. This solution also optimizes the layout of the branches, illustrating the relationships by generation in a compact but easily readable way. For privacy reasons, I have recreated the functionality here using the British monarchy's family tree as a sample dataset.

A.J. Goldsman

About

Core Tech Stack

Programming Languages

Python Libraries

Software Tools

Case Studies

Side Projects

Overview

Dr. Playlist

Named Entity Recognition (NER) Annotator

Kinship Linkage Tree

Resume

Contact