To achieve full understanding of the use and application of ML algorithms, our participants will work on a real-life industry project, translating theoretical knowledge to practical process and overcoming realistic challenges.
~6 months
Real data provided by company
~400 work hours total
Experienced mentors provided by Y-DATA
2-3 students
Weekly meetings with company data-owner
App Domain matching
Develop matching algorithm that will recommend a match between applications and domains
Predicting formation of dry areas in DSW evaporation ponds
Predict the formation of dry areas in both the Salt and Carnallite Ponds
Evaluation of DL keypoint-matching approaches for structure from motion
Train deep learning models to detect, describe and match keypoints and fine tune these models to specific domain of chronic wounds images
Cloaking Score for Taboola campaigns
Create ML model that is able to score each new live campaigns in Taboola with cloaking score
Diagnosis prediction
Build a high performance classifier that predicts the diagnosis of the physician using the information collected in the visit.
Prediction of chip failures at early stages of production flow
Build a model which predicts failures of manufactured chips based on indicators coming from the different stages of production.
Automatic representation of molecules as complicated features
Develop an property prediction algorithm based on advanced molecular features.
Port mapping using behavioral vessel data
Use vessel behavioral data in order to map port related areas of interest.
Speeding up Transformer-based NLP models
Pulmonary Embolism Identification
Train NLP models based on BERT, ELECTRA, ROBERTA and other models using a GPU, and then experiment with various methods to reduce their complexity and run times on a CPU.
Build an algorithm aimed at detection and classification of PE cases based on a Kaggle freely available data set of chest CTPA images.
Full project cycle
The process of working on the project follows popular industry standards and methodologies and incorporates a growing set of tools the students possess to methodically understand and solve a real-world problem. Our students have a full-cycle data science project in their portfolio upon graduation, covering all industry-standard stages: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation.
A customer operates a forum where programmers ask each other questions, provide answers and rate questions giving them \"ups\" and \"downs\". The forum has a core expert community that provides good answers and valuable insights. However, they often waste their time handling questions of little to no value: marking questions as duplicates and redirecting them, closing topics with incoherent or irrelevant questions etc. Because of this, the overall efficiency of the system suffers.
Automatic detection of low-value queries in technical Q&A forum
Example Project
The customer desires to improve the system efficiency as measured by the mean time between a question being posted and the first accepted (upvoted) answer given. How do we translate this request into ML terms? Should this be treated as a classification (good question or not) or regression (predicted number of up/down votes) problem? Which metric should be used? Accuracy, precision, recall, AUC etc. – which is most relevant to the situation? Is the problem symmetrical? We’re more concerned with losing a valuable question then missing several low-value ones. What are our resource constraints? We probably need a solution that runs immediately once a question is posted, so can’t use resource-heavy algorithms.
Understanding the project objectives and requirements and converting them into ML problem definition
Business Understanding
Downloading the questions and additional information, followed by a decision whether to use date-time stamps, user ID and other technical information. Cleaning the data by removing intrusive elements such as empty or corrupted questions, questions in Chinese asked in an English forum or embedded as text in a picture. Are more low-value questions asked by first-time users? Maybe we want to construct a separate model for first-time and veteran user questions. Are most questions plain text or do they use embedded code which should be parsed and analyzed separately? Do we have enough questions with up/down votes to construct a metric, or are there many old questions predating the up/down vote system which can’t be used and should be removed?
Data collection and exploration, detection of data quality issues. Gaining initial insights, recognising potential hidden information
Data Understanding
Recognizing which features provide statistically significant contribution to the problem at hand, and removing those that don’t: Is our users’ geolocation relevant to question-quality? Is the vast majority from the same region? Extracting useful features for learning and constructing new ones when needed: parsing date-time stamps, text vectorization, one-hot encoding keyword tags, etc. If we’re splitting the data into training and test sets, how is the split made? By time? By user ID?
Construction of the final dataset, data cleaning, feature selection and feature engineering
Data Preparation
The heart of the learning process: selecting and building the model. We need to choose an algorithm: be it a simple logistic regression or XGBoost or an ensemble of neural networks, it must be chosen based on the resources available and the peculiarities of the problem. Then we train it, making sure we avoid overfitting, and tune hyperparameters to maximize its performance.
Salecting and applying a model, calibrating its performance, testing and perfecting it
Finally, our model is ready, it marks all low-value questions with astounding precision and leaves all relevant questions intact. But will it help to boost consumer's metric? To check it we should probably design and conduct an A/B-test and to determine statistical significance of the results. If A/B-testing is impossible or undesired, we may probably use some sort of causal impact inference.
In-depth review and analysis of the model and the solution it offers and its sutability to the business issues at hand