Czarina is inspired by the vast expanse of data to be discovered and explored everyday.
I’m a data scientist currently working on data projects to stay innovative. Recently, I completed a data science immersive program focused on machine learning and data analysis. Prior, I held the role of practice manager where I managed and analysed company data to drive the growth of the practice. (résumé)
My purpose is to serve people using data science tools. I’m skilled for data wrangling, data analysis and visualization, machine learning, and results communication. To keep pace with the field, I follow developments and try to understand new tools and techniques everyday.
Data Analysis and Regression Modeling
Multimedia Data Processing and Analysis
Classification Predictive Modeling
Text Mining with Clustering Analysis
Recommendation Systems Development
Data Analysis and Regression Modeling
An analysis and regression modeling of over 21,000 real estate transactions in King County, Washington is completed to improve home valuation and real estate advisory.
The findings include top features that are important to the bestimate model, which are square footage of living space, distance to Seattle, total distance to both Seattle and Redmond, distance to Redmond, and square footage of living space of the nearest 15 neighbors.
The bestimate model that is a tuned Random Forest regressor performs best in predicting house prices based on over 20 features. It explains 88% of the variance in the data and its predictions are USD 107,000 off from the actual prices on average.
Github: Repository
Multimedia Data Processing and Analysis
Video sharing applications today lack the functionality for users to search videos by their content. As a solution I developed a searchable video library that processes videos and returns exact matches to queries using machine learning and artificial intelligence including speech recognition, optical character recognition, and object detection.
Github: Repository
Classification Predictive Modeling
An analysis and classification modeling of over 25,000 survey responses is completed to guide public health efforts on vaccination outreach.
The highest accuracy and precision of 86% and 75% respectively is attained by the Extra Trees model predicting H1N1 vaccination status.
According to the permutation importances of the model, the top features that affect H1N1 vaccination status are seasonal flu vaccination status, direct recommendation from a doctor, and health insurance.
Github: Repository
Text Mining with Clustering Analysis
An analysis and natural language processing of thousands of tweets is completed to predict sentiments during SXSW and provide insights to brands and products at the conference.
Though the extra trees classifier has the highest test accuracy, the Multinomial Naive Bayes model performs the best at classifying negative and positive sentiments.
Clustering analysis is performed to identify themes and topics that emerged, and recommendations are made accordingly.
Github: Repository
Recommendation Systems Development
A recommender engine is developed using 200,000 reviews at Rent the Runway to expose users to relevant products that tailor to their preferences.
Content-based recommenders and collaborative filtering systems are implemented, including K-Nearest Neighbors and Matrix Factorization algorithms.
The recommendations are generated according to predicted user ratings. Upon evaluation of the models, the Singular Value Decomposition resulted to the lowest mean absolute error of 0.5 on the five point rating scale.
Github: Repository
Video sharing applications today lack the functionality for users to search videos by their content. As a solution I developed a searchable video library that processes videos and returns exact matches to queries using machine learning and artificial intelligence including speech recognition, optical character recognition, and object detection.
Read moreOther interests include creative design and web development, such as building this business website for a client.