Phishing Terminology

Phishing Report

We gather information about phishing activity detected by multiple phishing feeds. Phishing reports are records that we collect from a threat intelligence feed (a blocklist) that identify the URL or domain name in the report as a phish. Some of the feeds that we employ report on multiple threats, e.g., spam, malware, botnet, phish, etc.

The same phishing activity can be identified in more than one feed.

When we prefix “report” with a particular threat such as phishing, we are indicating that we are using only those reports identified as phish for our analyses.

Phishing Attack

We define a phishing attack as a phishing site that targets a specific brand or entity. We determine if multiple phishing reports refer to the same phishing activity and collapse duplicates to yield phishing attacks.

Phishers commonly point many URLs to one phishing site and use wildcarding and redirection techniques to hide the location of the phishing site from investigators. They may use a single domain name to host several discrete phishing attacks against different companies or may use multiple URLs for any given phishing site to host multiple pages.

To identify unique attacks from this diverse environment of domains, hostnames, and URLs, we examine URLs and metadata associated with URLs. We apply a set of rules to compare URLs for similarities; for example, if the hostname in two or more URLs is the same, and if the report dates for those URLs fall within 7 days of each other, and if the target across those URL reports was the same, then we treat this set of URLs as involved in one phishing attack.

Phishers use a wide variety of URL construction methods, so we formulated additional rules to group URLs into attacks based on observed cases. When we prepare our reports, we perform a final round of manual examination to find additional batches of related URL. For example, some phishers generate multiple subdomains as part of one attack. In some cases, phishers register large numbers of pseudo-randomly generated domain names (see Automating Detection of “Random-looking” Algorithmic Domain Names). In such cases, if the date of the abuse report and the target (brand) were the same, and the reporting feed was the same, then we grouped all those URLs as part of one attack.

Our methodology may result in underreporting the number of attacks. Others who apply a similar methodology may independently arrive at slightly different (higher) numbers; for example, if one were to use the report date window of 30 days from the research paper, COMAR: Classification of Compromised versus Maliciously Registered Domains, but in all other respects apply our rules, the results might identify more attacks.

Phishing Domain Scores

To allow comparison of large and small Top-level Domains, we use a scoring metric, TLD phishing domain score, which is calculated by dividing the number of domain names reported for phishing in a TLD by the number of domains delegated from that TLD.

TLD Phishing Domain Score =
(number of unique domains reported for phishing in a TLD/domains delegated from TLD) * 10,000

This score can highlight where high-volume phishers place multiple phishing URLs on one domain. In such cases, we want to accurately and separately count unique phishing domains from phishing attacks.

We use a similar scoring metric to compare small and large gTLD registrars. The gTLD phishing domain score is a ratio of the number of domain names used for phishing to the number of registered domain names under management (DUM) at that gTLD registrar.

gTLD Registrar Phishing Domain Score =
(number of unique domains reported for phishing in a gTLD registrar DUM / DUM at gTLD Registrar) * 10,000

Lastly, we use a metric to compare small and large hosting networks (ASNs). The Phishing Attack score is a ratio of the number of IPv4 addresses associated with hosting spam content or spambots to the number of routed IPv4 addresses allocated to an autonomous system.

Phishing Attack score =
(number of unique IPv4 addresses associated phishing attacks with in ASN/ routed IPv4 addresses allocated to ASN) * 10,000

In our annual landscape studies, we use a similar metric to measure the prevalence of phishing in TLDs and gTLD registrars. Here, we take the sum of the four quarters of unique phishing domains reported and divide by the average of the domains under management per gTLD registrar for each of the four quarters. We call this metric a Yearly Phishing Domain Score.

Yearly TLD Phishing Domain Score =
(number of unique phishing domains reported in a TLD at end of period / number of domains delegated from a TLD) * 10,000
Yearly gTLD Registrar Phishing Score =
(number of unique phishing domains reported in a gTLD registrar at end of period / number of domains under management at gTLD Registrar) * 10,000
Yearly Phishing Attack score =
(number of unique IPv4 addresses reassociated with phishing attacks in ASN at end of period/ routed IPv4 addresses allocated to ASN) * 10,000

Phishing Attack Scores

As we do for phishing domains, when we want determine whether a hosting network (AS) has a higher or lower incidence of phishing relative to others, we use a scoring metric. Hosting Networks (AS) Phishing Attack Score is a ratio of the number of phishing attacks hosted in an Autonomous System to the number of IPv4 addresses routed by that hosting network (AS).

Hosting Networks (AS) Phishing Attack Score =
(number of phishing attacks reported as hosted at an ASN/IPv4 addresses routed by that ASN) * 10,000

In our annual landscape studies, we take the yearly count of unique phishing attacks reported and divide by the number of IP addresses routed by each hosting network (ASN). We call this metric Yearly Hosting Network (ASNs) Phishing Attack Score:

Yearly hosting networks (AS) Phishing Attack Score =
(number of unique phishing attacks reported as hosted at an ASN /
number of IP addresses routed by that ASN) * 10,000

Note that the calculation of these two metrics yields different results (we use different inputs for the numerators in the division); in particular, one cannot draw any conclusion by comparing the scores from a quarterly phishing score against an annual phishing score. Instead, we encourage comparisons of quarterly phishing attack scores over time, as well as annual phishing attack scores over time.

We use a similar scoring metrics to compare small and large TLDs and small and large gTLD registrars.

Maliciously registered domain names

(also, malicious domain registrations).

We define a maliciously registered domain as a domain registered by a criminal to carry out a malicious or criminal act. For our studies, we distinguish maliciously registered domains from compromised domains, which we define as domain names that were registered for legitimate purposes but co-opted by criminals through some form of compromise.

For example, an attacker may hijack a legitimate user’s domain registrar account, alter the DNS to resolve a name or URL to a host that the attacker controls; here, the domain and DNS are compromised. An attacker may also exploit a vulnerability at a legitimate web hosting site, upload fake or malicious content to a web site, and create a phishing URL that points to the malicious content at the legitimate web site; in this case, the web server is compromised.

This distinction is important because it often identifies where investigators should go for assistance with mitigation of the criminal activity:

If the domain is maliciously registered, an investigator will seek assistance from a domain name registrar, a TLD operator, or the operator that provides DNS for the malicious domain to suspend the domain name registration or name resolution.
For a compromised domain, such suspensions further victimize a legitimate party already victimized by the compromise, so investigators will contact the administrator of the compromised host to have the malicious content removed.

Note that parties that discover phishing pages will do their best to blocklist URLs that identify malicious content to avoid further victimization, whereas they may block maliciously registered domain names (and thus all hostnames and URLs created using this name) to contain the pervasive malicious activity.

We use multiple criteria to determine if a domain flagged by a blocklist has been maliciously registered. The most important are:

Domain age. We look at the number of days between the domain’s registration date and the time the domain was blocklisted. (Or if registration date is not available via RDAP or WHOIS, the time between the domain was first observed in a passive DNS query and when the domain is blocklisted.) We consider domains blocklisted within 90 days of registration to be malicious. Studies indicate that such domains are usually too new to have been compromised, and that a high percentage of maliciously registered phishing domains tend to be used within days of registration. This method excludes some maliciously registered domains that are “aged” by bad actors (by not using them for more than 90 days) in an attempt to improve domain reputation. Other researchers have considered domains as malicious if they were blocklisted within a longer 150-day period after registration.
The composition of the domain name. We search listed domains for tell-tale strings that indicate malicious use. These include strings designed to mislead users (login, security, etc.), and brand names that are the targets of phishing attacks (citibank, whatsapp, etc.). We also search for close misspellings and variations of these terms, which criminals often register to evade simple matching measures. Our tell-tale strings lists include strings that we have "observed in the wild,” i.e. seen used in confirmed attacks.
We look for clear evidence of common control and usage, such as batches of domains registered at the same time at the same registrar, delegated to the same nameserver pair, and for algorithmically generated domain names that follow patterns, such as sequences of numbers and letters. (See “bulk registrations” below.)

These methods share similarities with those used by other researchers, such as the COMAR and MalCom methods used by KOR Labs, the service provider to the NetBeacon Institute. Their and our calculations for malicious phishing registrations have historically been within a few percentage points of each other.

Bulk registered domain names

We define a set of domains to be bulk registered if the domains were blocklisted and at least ten domains were registered through the same registrar with no more than ten minutes between consecutive registrations. Domains within these sets usually share lexical/string characteristics, and the same nameserver set (registrar-nameserver combination), and we confirm the presence of those features for large batches. The domains in a bulk set are usually in the same gTLD, but sometimes bad actors register domains across several TLDs at the same time.

We take the complete set of cybercrime domains reported to the cybercrime feeds (whether or not they relate to the specific cybercrime we are investigating), and using Registration Data Directory Services (RDDS) such as RDAP or WHOIS (where that data available and where access to that data is not rate limited) to determine the date and time the domain was registered. Using this data we can determine, for each registrar, the date-time order of cybercrime domain registrations. From there we can determine if the time between consecutive registrations through the same registrar was within 10 minutes and then if there are sequences of at lest ten such domains to ascertain that the sequence comprises a set of bulk registered domains.

Our method under-counts the size and occurrence of bulk registrations, because it counts only domains that were blocklisted. There may have been more registered domains in the bulk set, but the blocklist providers did not list them all. Our method also fails to capture smaller (lower volume) batch registrations, and those spaced out over longer time periods.

Note that bulk registrations and the concept of associated domain checks under policy-making consideration at ICANN are two related concepts. The domains in a bulk registration set are by definition related to each other and were presumably registered by the same party. A competently performed associated domain check by the registrar should uncover all the domains in a bulk set we identified, plus perhaps others.

In research about malicious bulk registrations and associated domain discovery, researchers in ICANN’s Office of the CTO found that “batch registrations are prevalent, significantly predict overall abuse rates, and are useful for pivoting and expanding from known malicious ‘seed’ domain sets.”

We consider bulk registered domains as one of the criteria used to determine if a domain was maliciously registered.

Hosting Network (ASN)

Autonomous system (AS) is a term used to describe a collection of networks that operate under a common administration. Its primary use is to identify peers and destinations in the global Internet routing system. Conceptually, routing at the AS level is a function of (i) identifying the autonomous systems that are adjacent to your AS, (ii) learning from your AS peers which destination autonomous systems you can reach by routing through them, and (iii) choosing which peer to use to optimally forward traffic to a given destination AS.

Autonomous systems are assigned unique numbers (ASNs) as part of the registration process that is required for operators to participate in the global Internet routing system. Autonomous system numbers are a sort of “shorthand”. Each number “represents” a list of IP address blocks (or IP prefixes) that are “reachable” in the AS. Cyber investigators are interested in destination AS numbers because this is the hosting network wherein an IP address that reportedly hosts phishing, malware, or other criminal site (or content) is located.

Whois services operated by Regional Internet Registries (ARIN, APNIC, AFRNIC, LACNIC, RIPE) provide registration data, including contact data, for autonomous systems and the IP address blocks that were allocated to autonomous system registrants. Some organizations operate several or dozens of autonomous systems. While we study the degree of “churn” in autonomous systems – adds and drops of IP address block allocations to an AS - we only report on individual autonomous systems, by number. Thus, we can call attention to individual hosting networks (ASN) where there are interesting concentrations of cybercrime activity, but we continue to explore ways to

identify organizations where interesting concentrations of cybercrime activity are present across several of the autonomous systems under that organization’s administration.

Phishing Target Identification

We use URL blocklists that identify targets in the metadata included in phishing reports. Reports from each phishing feed we consume varies slightly in its granularity and nomenclature. We compile lists of these variations and normalize spelling as part of our curation; for example, if one feed uses “PayPal” while another uses “PayPal Inc.”, we treat these as one target and normalize our data to a common form of the company name so that we can analyze brand data.

Some feeds pose classification differences. For example, WhatsApp is owned by Facebook. Some sources report WhatsApp as a separate brand, but other sources report the same WhatsApp phishing URLs as attacks against Facebook. We use the target reported by each feed, with the granularity (discrimination) that feed offers.

In some cases, one source may positively identify a URL as a phish against a specific target, but another source may only report the same URL as a phishing attack against “unknown” or “generic” brand. In these cases, we use the most detailed information available and attribute that attack to the specific brand. In the cases where an attack’s target is not determined by any feed, we set those attacks aside when analyzing brand data.

We do not rely exclusively on target identification that we find in URL blocklist metadata.We examine hostnames and URL PATHs for identical brand string matches, look-alike strings, and other deceptive strings that phishers employ. We also selectively apply fuzzy matching for brands that are common, persistent phishing targets.

DNS Data

Some of our threat intelligence source feeds provide IP (A record) data and AS data. We retain these data, but for our studies, we want addressing information for every hostname reported, so we also query for the A record of every reported domain name that we collect and determine the AS by using Team Cymru’s IP to ASN mapping service. We use RIPE-NCC WHOIS to find AS name, organization, and IP prefix(es). We obtain the number of IPv4 addresses in an AS from BGPview by first querying for the IP prefixes allocated to an AS and then calculating the number of IPv4 addresses as the sum of addresses represented by the prefixes.

To identify TLDs we begin with the IANA root zone list. We also use the Public Suffix List to identify the zones in which registries offer third level registration, for example, names assigned (delegated) from co.uk. For gTLD domain names we obtain registry WHOIS to identify the sponsoring registrar, along with the registrar’s IANA ID for normalization.

In our tables and measurements, the number of domains in each gTLD, and the number of gTLD domains sponsored by each registrar are obtained from the monthly ICANN reports for the latest month available when we begin writing a report or quarterly summary. Reference to domains under management (DUM) are also made to NTLDSTATS.com. ccTLD domain counts are obtained from the web sites of the registry operators and from DomainTools.

Phishing feeds

We consume two types of phishing reporting services (feeds): URL block lists (URLBLs) and domain block lists (DBLs or DNSBLs).

Our URL source feeds for phishing – APWG, OpenPhish, and PhishTank – identify a target brand for each report. These sources determine target by heuristics – e.g., they parse the content of the email phishing lure, match the logo images and wordings on the phishing site to the legitimate brand site content, etc. - or by manual verification.

We also use the Spamhaus DBL. This feed does not provide target information but it does classify domains according to the type of threat the domain is used to perpetrate. For phishing reporting, we use only the DBL response codes 127.0.1.4 (phish domain) and 127.0.1.104 (abused legit phish). We do not include Spamhaus DBL-listed domains when we analyze brand data.

Phishing Activity