A customer operates a forum where programmers ask each other questions, provide answers and rate questions giving them \"ups\" and \"downs\". The forum has a core expert community that provides good answers and valuable insights. However, they often waste their time handling questions of little to no value: marking questions as duplicates and redirecting them, closing topics with incoherent or irrelevant questions etc. Because of this, the overall efficiency of the system suffers.
Automatic detection of low-value queries in technical Q&A forum
The customer desires to improve the system efficiency as measured by the mean time between a question being posted and the first accepted (upvoted) answer given. How do we translate this request into ML terms? Should this be treated as a classification (good question or not) or regression (predicted number of up/down votes) problem? Which metric should be used? Accuracy, precision, recall, AUC etc. – which is most relevant to the situation? Is the problem symmetrical? We’re more concerned with losing a valuable question then missing several low-value ones. What are our resource constraints? We probably need a solution that runs immediately once a question is posted, so can’t use resource-heavy algorithms.
Understanding the project objectives and requirements and converting them into ML problem definition
Downloading the questions and additional information, followed by a decision whether to use date-time stamps, user ID and other technical information. Cleaning the data by removing intrusive elements such as empty or corrupted questions, questions in Chinese asked in an English forum or embedded as text in a picture. Are more low-value questions asked by first-time users? Maybe we want to construct a separate model for first-time and veteran user questions. Are most questions plain text or do they use embedded code which should be parsed and analyzed separately? Do we have enough questions with up/down votes to construct a metric, or are there many old questions predating the up/down vote system which can’t be used and should be removed?
Data collection and exploration, detection of data quality issues. Gaining initial insights, recognising potential hidden information
Recognizing which features provide statistically significant contribution to the problem at hand, and removing those that don’t: Is our users’ geolocation relevant to question-quality? Is the vast majority from the same region? Extracting useful features for learning and constructing new ones when needed: parsing date-time stamps, text vectorization, one-hot encoding keyword tags, etc. If we’re splitting the data into training and test sets, how is the split made? By time? By user ID?
Construction of the final dataset, data cleaning, feature selection and feature engineering
The heart of the learning process: selecting and building the model. We need to choose an algorithm: be it a simple logistic regression or XGBoost or an ensemble of neural networks, it must be chosen based on the resources available and the peculiarities of the problem. Then we train it, making sure we avoid overfitting, and tune hyperparameters to maximize its performance.
Salecting and applying a model, calibrating its performance, testing and perfecting it
Finally, our model is ready, it marks all low-value questions with astounding precision and leaves all relevant questions intact. But will it help to boost consumer's metric? To check it we should probably design and conduct an A/B-test and to determine statistical significance of the results. If A/B-testing is impossible or undesired, we may probably use some sort of causal impact inference.
In-depth review and analysis of the model and the solution it offers and its sutability to the business issues at hand