August 28, 2019 - Yongxin Xi

Quantifying Cluster Quality with Unsupervised Machine Learning

A one-time use of labels with unsupervised machine learning can expedite cluster assessments by providing context to reduce false positives.

Unsupervised machine learning (UML) is one of the essential components of DataVisor’s approach to proactive fraud discovery. Because UML can start producing results quickly and accurately—as it doesn’t require historical labels, lengthy training times, or frequent re-tuning—our solutions can surface patterns that indicate coordinated activity before attacks actually launch, thereby preventing downstream damage. However, because so much of the conversation around applications of UML focus on this ability to operate without labels, what often goes unspoken is that UML can, in fact, benefit from the use of labels in certain circumstances and for specific purposes. The conditional use of labels in conjunction with UML applications is the primary subject of this article.

To begin, we’ll explore rationales for why DataVisor uses unsupervised machine learning

Unsupervised Machine Learning: Uncovering Patterns Without Labels

Normal User Behavior
Cluster analysis, one of the main methods used in UML, is an ideal approach for identifying the kind of patterned behavior that often indicates fraudulent activity. Patterns are one of the key signals that differentiate legitimate and fraudulent behavior. So-called “normal” users are by nature unpredictable—we each have our habits, lifestyles, and behaviors, and we can be erratic, impulsive, and even random when it comes to our choices online. Additionally, normal user behavior is often subject to influences that can’t necessarily be seen in the available data. For example, a user’s shopping history online may be fairly consistent over the course of a year, but then one day, they might suddenly start buying hundreds of dollars of goods a week from a retailer they’ve never shopped from before. Fraud? Possibly. However, it could be that they got a new puppy, and are now regularly ordering items from PetSmart.

Fraudulent Behavior
Fraudster behavior is different. Fraudsters, of necessity, engage in patterned activity. They are continually driven by one goal—make as much money with as little effort as possible—and faced with one ongoing reality; if they don’t move fast, they’ll get caught. The combination of these two factors means that fraudsters don’t have the luxury of impersonating legitimate users to the full measure of our idiosyncrasies—they have to make concessions in order to scale. 

Data-Driven Distinctions Between Real and Fake Users
As a way to understand the distinction between the individualized behavior of real users vs. the patterned behavior of fake users operated by fraudsters, think of the challenges of filming enormous battle scenes for an epic movie like Troy; scenes that required 150,000 soldiers! To create those scenes, real actors were used for the main characters, but the bulk of the soldiers were produced using AI-powered CGI technology. While the scenes worked as experienced in the final film, close inspection would reveal which actors were real and which were not, because the CGI characters were created “in bulk,” and accordingly share discernible commonalities.

In the case of fraud, those commonalities—such as user profiles, digital fingerprints, scripted behaviors, and more—are similarly discernable, providing you have the means to execute “close inspection.” This is what UML makes possible, and because the process relies entirely on the available data itself, no pre-existing labels are required. Put another way, historical information is not needed; the process is entirely data-driven, and actionable insights are derived directly from what is observable in the data itself. Compared to supervised machine learning (SML), this is a significant advantage, because SML’s reliance on labels necessitates ongoing refreshes and lengthy retraining periods to maintain model quality. Another disadvantage with SML is the fact that new fraud labels are generally late in arriving, and reactive models end up missing new and evolving fraud patterns accordingly.

Use of Labels with Unsupervised Machine Learning

Despite the powerful results UML can deliver without labels, there are certain circumstances where using labels can, in fact, enhance performance, by providing additional information that allows us to better contextualize discovered patterns in data.

Deceptive Patterns and the Importance of Contextual Detection
More often than not, clusters with suspicious links will confirm coordinated fraud patterns. However, sometimes in a client’s data, a particular group that displays coordinated behavior will, in fact, be a good user group. For example, consider a bank application center that processes account openings by telephone. As a customer, you dial in, you provide answers and information to a bank agent, and they enter that information in a company computer. Over the course of a week, that agent may process 80-100 different applications. From a data standpoint, all these applications will have very similar digital fingerprints, and as a group, this may look abnormal, despite their being from a legitimate origin. To prevent false positives, clusters like these can be more readily deciphered and removed if some labels are provided. Doing so pushes the UML model to a higher level of precision. 

Fraudulent behavior will often appear very similar. Attacks may originate from the same IP, or from a group of devices all using the same software, or the applications might all come in during a very specific time frame, or the email addresses associated with the account could all include similar naming conventions. Without a deeper dive into the data, and without additional contextualization, it’s challenging to know which group indicates a coordinated attack, and which is legitimate. This is where labels can help us. There can be certain cases where good users will exhibit patterned behavior, due to there being a fundamental mechanism promoting that patterning. These behaviors get clustered, raising the possibility of producing a false positive. However, by using a label that establishes a clarifying context, it is easy to quantify the quality of the cluster and determine that is, in fact, benign.

Data Errors
Another use case when labels can be helpful is when there are errors in the data. Typically, a client will have a data team, and they’ll use a script to pull information from their databases, and they’ll then provide DataVisor with a blind data set. In this process, mistakes are inevitable, and there are always bugs to contend with. These issues may create artificial clusters that would not have existed if the data were error-free. In other cases, clients can occasionally fail to collect certain features under certain circumstances. For example, their SDK might fail to collect device id and pad the field with all zeros—a pattern that looks like fraud. In cases like these, labels can help us quickly unearth these issues and prevent our systems from producing false positives that are actually the result of data errors.

Labels: Supervised vs. Unsupervised Machine Learning

When it comes to the matter of labels, there is one fundamental way we can understand the difference between supervised and unsupervised machine learning: supervised machine learning cannot function without labels—unsupervised machine learning can.

Supervised machine learning (SML) uses labels to do feature selection—in other words, SML uses labels to “handpick” things it thinks might be useful for fraud detection. For example, consider a traditional ATO scenario. A credit card user never shops for toys. Suddenly, they start making extensive purchases in the toy category. This represents a category shift, and “category shift” is a signal; a feature that SML can use. 

With unsupervised machine learning, the approach is entirely data-driven. To reveal the unknown, we cannot rely on labels, so we ask the data itself. We mine for suspicious clusters, and we do clustering in multi-dimensional subspaces where the underlying algorithm can automatically pick features it thinks are most useful for clustering. With this approach, we don’t use labels for detection. Feature engineering and selection are automatic. If we do use a label, it is merely to expedite the process of quantifying cluster quality. It should be noted that, per client requests, we can also use labels to flag clusters and increase confidence as regards a detected cluster—this can, in turn, enhance our ability to capture related fraud.

Conclusion

At DataVisor, we rely on unsupervised machine learning for its speed, its agility, and its adaptability. Modern fraudsters change tactics with high frequency, and supervised machine learning simply can’t keep up. SML models decay quickly, and new labels are constantly required to maintain performance levels. Whereas with UML, we can mine data for clusters, solve for data issues, and expedite cluster assessment with just a one-time use of labels. After that, the model will remain robust for years. 

Most importantly, we have a much wider array of options. We can work without labels at all. We can use labels one time for cluster assessment. We can use labels to find fraud related to known fraud users. We are flexible and can accommodate client requests to match the complexity, speed, and scale of modern fraud.

about Yongxin Xi
Dr. Yongxin Xi is Director of Engineering, Analytics, at DataVisor. She obtained her Ph.D. in machine learning from Princeton University in 2011 and has been an anti-spam and anti-fraud specialist ever since. Passionate about the mission to make the online world a safer place, she currently leads DataVisor’s U.S. trial team in building cutting-edge models to catch fraud quickly and accurately.
about Yongxin Xi
Dr. Yongxin Xi is Director of Engineering, Analytics, at DataVisor. She obtained her Ph.D. in machine learning from Princeton University in 2011 and has been an anti-spam and anti-fraud specialist ever since. Passionate about the mission to make the online world a safer place, she currently leads DataVisor’s U.S. trial team in building cutting-edge models to catch fraud quickly and accurately.