Understanding customer behavior and preferences is one of the most important challenges for any firm, and involves problems like customer segmentation, modeling and predicting consumer choices as well as determining the impact of assortments and prices on consumer demand. Many of these problems have become increasingly important in the current age of Big Data, where firms have the ability to collect granular individual-level data at various points of interaction with the customer, both online and offline. Availability of such granular data allows a firm to greatly benefit from micro-segmentation of customers, leading to micro-targeting in the form of personalized product offerings, promotions, and recommendations. The existing approaches in the marketing and operations literature are incapable of scaling to such large amounts of data. Further, the most popular approach is to use the Expectation Maximization (EM) algorithm to infer the latent characteristics of the customer population, and even though it performs really well in practice, it suffers from some fundamental issues like slow convergence, sensitivity to initialization as well as lacking any theoretical guarantees on the obtained solution.

Our goal is to propose alternative techniques for solving these problems that address some of the existing limitations: (1) provable guarantees on the quality of the obtained solution (2) robust to modeling assumptions and (3) fast and scalable to large amounts of data. Some of our techniques are inspired by recent work in the machine learning literature on large-scale and sparse optimization.

Past work includes the design of a novel reputation system for crowd workers that is robust to manipulation and can help boost the accuracy of crowdsourced labeling tasks by filtering out unreliable and/or adversarial workers.

Banner Image

Drafts & Publications

A Data-Driven Approach to Customer Segmentation

Segmenting customers based on their preferences is a fundamental marketing problem. The current practice for segmentation relies on estimating the parameters of a mixture model using Expectation-Maximization (EM) based methodologies. The EM-based methodologies are not consistent, are sensitive to initialization, and exhibit slow numerical convergence. To address these issues, we propose a data-driven segmentation methodology that directly segments customers, without the need for estimating latent model parameters. Our approach relies on a 'model-based' projection technique, inspired by the recent projection-based techniques developed in the machine learning community for learning mixture models. Our approach is versatile (broadly applicable) and flexible (allows for incorporation of problem-specific structure). Theoretically, we derive separation conditions under standard assumptions that are required to guarantee asymptotic recovery of the underlying segments. Empirically, we demonstrate using synthetic data that our algorithm is fast, robust to model misspecification, and consistently outperforms popular benchmarks. Using real-world data, we characterize the segments recovered by our algorithm and demonstrate an automatic feature selection mechanism, which results in sharper separation of the segments than existing benchmarks.

Reputation-based Worker Filtering in Crowdsourcing

In this paper, we study the problem of aggregating noisy labels from crowd workers to infer the underlying true labels of binary tasks. Unlike most prior work which has examined this problem under the random worker paradigm, we consider a much broader class of adversarial workers with no specific assumptions on their labeling strategy. Our key contribution is the design of a computationally efficient reputation algorithm to identify and filter out these adversarial workers in crowdsourcing systems. Our algorithm uses the concept of optimal semi-matchings in conjunction with worker penalties based on label disagreements, to assign a reputation score for every worker. We provide strong theoretical guarantees for deterministic adversarial strategies as well as the extreme case of sophisticated adversaries where we analyze the worst-case behavior of our algorithm. Finally, we show that our reputation algorithm can significantly improve the accuracy of existing label aggregation algorithms in real-world crowdsourcing datasets.