While the threat landscape is extremely sophisticated and diverse, almost all threats involve communication with the internet at some stage of their attack. This communication could include attackers transmitting malicious payloads for initial access, ransomware communicating with command and control to exchange encryption keys, or espionage tools exfiltrating sensitive information to sharing sites.
These communications either start with a hostname that needs to be resolved to an IP address, or with an IP address directly, but in all cases, an IP address is ultimately the identifier of communication. In simple terms, an IP address is a numerical label assigned to each device connected to a network that uses the internet. Like an address to your home, an IP address provides information that helps routing protocols on the internet find destinations.
Obviously not all communications with the internet are malicious, including for example, your favorite social media websites. A social media site is hosted on an IP address where you can browse different individual’s profiles and interact with them. News articles you click on from a friend’s post resolve and redirect to an IP serving content from a different site. Some IP addresses run multiple concurrent services or can serve content for multiple different sites from entirely different content owners. Some identical content can get served from multiple different IP addresses. These dynamics make differentiating between IP addresses that are up to no good versus IP addresses that are harmless a daunting task that all security vendors take-on.
The Anatomy of an IP
The beginning of an IP address is the location of the destination network, called the Network Prefix, and the end is the location of the device on that network, called the Host ID. The network prefixes are written in a format that uses a slash (“/”) character followed by a decimal number to indicate how many bits refer to the network prefix. For example, 198.51.100.0/24 has 24 bits allocated for the network prefix and the remaining 8 bits reserved for host addressing resulting in the range 198.51.100.0 to 198.51.100.255 addresses belonging to this network. A subnet is a division of the larger network to smaller ones. To find the destination of a smaller network, a subnet mask divides the IP address into a network address and host address. In the example in Figure 1, the subnet-mask indicates a packet should first route to the network address 172.16.2.0 and then route to the host address 0.0.0.15.
Figure 1. IP Address breakdown.
IP addresses are often tied to physical infrastructure related to their host or provider, more specifically Internet Service Providers (ISPs). ISPs are the organizations that manage a given range of network prefixes. Some ISPs might spend a lot of effort maintaining a high reputation for security, while others, like “bulletproof hosting” providers, might openly welcome customers that want to host malicious content. Understanding the variety of threat levels that occur between ISPs is often overlooked but is an area SophosAI has applied to their IP detection framework.
Challenges with IP Detection
Despite the age and simplicity of this technique, block-lists at the IP/Domain level are still a very popular method of defense against malicious internet traffic. However, they are difficult to maintain and are easy to evade because the IP space is very large and very active. An example of block-list evasion is when an IP that routes to a malicious location can temporarily or conditionally route to benign content while a different IP from the same infrastructure routes to the original malicious content. This temporary change is called a “cool-down” period which often results in security vendors reversing the IP reputation from malicious to benign, allowing it to be used again for malicious activity in the future.
SophosAI wants to deviate from the traditional reputation-based systems to an approach that is easier to maintain, learns quickly, and classifies an IP regardless of any “cool-down” periods. To do this, they trained a model on a novel representation of IP addresses with additional information that can be found either in public registration information or from additional AI methods that use an IP’s neighborhood to fill in the gaps.
The New Approach
SophosAI collected over 450,000 IP addresses hosting web-content labeled either malicious or benign using internal telemetry tools. They also collected over 400,000 IPs related to e-mail campaigns from existing anti-spam lists. Tamás Vörös, a data scientist in SophosAI, used existing knowledge of how IP address space is allocated to Internet Service Providers (ISPs) to generate a heat map visualizing web and spam activity where blue points represent benign activity and red malicious. For more details on how these plots were generated, please read the more in-depth SophosAI blog post.Figure 2. S Web-based IP addresses organized by ISP defined blocks. Blue dots indicate an IP address that engaged in benign activity while red dots indicate malicious activity.
Vörös observed that there were distinct clusters of malicious activity and that some ISPs have contributions to the malicious landscape that are disproportionate to the size of the IP space they own.
The next challenge is using this information in a way that an AI model can understand. With this challenge in mind, SophosAI developed a new representation for IPs that allows the model to efficiently evaluate both the IP address itself as well as all the subnets that could potentially be generated from that IP address. These subnets can map to the high-level structure of the internet, indicating which collections of related IP addresses are controlled by a single organization. To further the use of physical infrastructure associated with an IP address, information about Internet Service Providers (ISPs) was added to the feature set that was fed to a detection model. By observing a high cluster of malicious activity in one segment of the IP space, a model can learn how to determine if different segments owned by the same ISP are also malicious. A challenge with using ISP for this task, however, is that we do not always have this information at detection time. To address this issue, a model was trained to map IPs to ISPs and the output was used an additional feature to train an IP detection model, improving results compared to using IP address alone. Figure 2 shows the model’s understanding of the IP space where each IP address is colored based on their respective ISP. Prior to training, the model does not differentiate well between IPs from one ISP versus another, but after training, you see a more distinct separation.
Figure 4. A visualization of IPs and their respective ISPs after model training. Each dot represents an IP address and the dot’s color indicates which ISP the IP belongs too. Before training, there is no real differentiation between the model’s understanding of the ISP space in terms of IP. However, after training, there are distinct clusters of IP addresses based on the ISP they belong to.
Using these models, they visualized the IP space more clearly and found groups of malicious IPs, regardless of any “cool-down” periods over time. Using a data-driven AI model to assign these IPs to a cluster of either benign or malicious activity allows better detection of never-before-seen IPs.
Figure 5. Model detections on IP address. The scores are based on how likely an IP is to be malicious.
Bringing It All Together
With Vörös’ research, SophosAI can better understand malicious activity distributed across the IP space, both in web and e-mail traffic. Not only did Vörös find clusters of malicious activity based on physical infrastructure of network traffic, but he also mapped IPs to their physical infrastructure when unknown, allowing detection of a never-before-seen IP. Integrating these findings with other tools enhances protection for organizations from potentially devastating ransomware attacks, targeted financial cybercrime, and cyber-espionage attacks stemming from malicious web-based or email-based campaigns.