Terminology
Phishing Report
We gather information about phishing activity detected by multiple phishing feeds. Phishing reports are records that we collect from a threat intelligence feed (a blocklist) that identify the URL or domain name in the report as a phish. Some of the feeds that we employ report on multiple threats, e.g., spam, malware, botnet, phish, etc.
The same phishing activity can be identified in more than one feed.
When we prefix “report” with a particular threat such as phishing, we are indicating that we are using only those reports identified as phish for our analyses.
Phishing Attack
We define a phishing attack as a phishing site that targets a specific brand or entity. We determine if multiple phishing reports refer to the same phishing activity and collapse duplicates to yield phishing attacks.
Phishers commonly point many URLs to one phishing site and use wildcarding and redirection techniques to hide the location of the phishing site from investigators. They may use a single domain name to host several discrete phishing attacks against different companies or may use multiple URLs for any given phishing site to host multiple pages.
To identify unique attacks from this diverse environment of domains, hostnames, and URLs, we examine URLs and metadata associated with URLs. We apply a set of rules to compare URLs for similarities; for example, if the hostname in two or more URLs is the same, and if the report dates for those URLs fall within 7 days of each other, and if the target across those URL reports was the same, then we treat this set of URLs as involved in one phishing attack.
Phishers use a wide variety of URL construction methods, so we formulated additional rules to group URLs into attacks based on observed cases. When we prepare our reports, we perform a final round of manual examination to find additional batches of related URL. For example, some phishers generate multiple subdomains as part of one attack. In some cases, phishers register large numbers of pseudo-randomly generated domain names (see Automating Detection of “Random-looking” Algorithmic Domain Names). In such cases, if the date of the abuse report and the target (brand) were the same, and the reporting feed was the same, then we grouped all those URLs as part of one attack.
Our methodology may result in underreporting the number of attacks. Others who apply a similar methodology may independently arrive at slightly different (higher) numbers; for example, if one were to use the report date window of 30 days from the research paper, COMAR: Classification of Compromised versus Maliciously Registered Domains, but in all other respects apply our rules, the results might identify more attacks.
Phishing Domain Scores
To allow comparison of large and small Top-level Domains, we use a scoring metric, TLD phishing domain score, which is calculated by dividing the number of domain names reported for phishing in a TLD by the number of domains delegated from that TLD.
TLD Phishing Domain Score =
(number of unique domains reported for phishing in a TLD/domains delegated from TLD) * 10,000
This score can highlight where high-volume phishers place multiple phishing URLs on one domain. In such cases, we want to accurately and separately count unique phishing domains from phishing attacks.
We use a similar scoring metric to compare small and large gTLD registrars. The gTLD phishing domain score is a ratio of the number of domain names used for phishing to the number of registered domain names under management (DUM) at that gTLD registrar.
gTLD Registrar Phishing Domain Score =
(number of unique domains reported for phishing in a gTLD registrar DUM / DUM at gTLD Registrar) * 10,000
In our annual landscape studies, we use a similar metric to measure the prevalence of phishing in TLDs and gTLD registrars. Here, we take the sum of the four quarters of unique phishing domains reported and divide by the average of the domains under management per gTLD registrar for each of the four quarters. We call this metric a Yearly Phishing Domain Score.
Yearly TLD Phishing Domain Score =
(number of unique phishing domains reported in a TLD across the year / number of domains delegated from a TLD) * 10,000Yearly gTLD Registrar Phishing Score =
(number of unique phishing domains reported in a gTLD registraracross the year / number of domains under management at gTLD Registrar) * 10,000
Phishing Attack Scores
As we do for phishing domains, when we want determine whether a hosting network (AS) has a higher or lower incidence of phishing relative to others, we use a scoring metric. Hosting Networks (AS) Phishing Attack Score is a ratio of the number of phishing attacks hosted in an Autonomous System to the number of IPv4 addresses routed by that hosting network (AS).
Hosting Networks (AS) Phishing Attack Score =
(number of phishing attacks reported as hosted at an ASN/IPv4 addresses routed by that ASN) * 10,000
In our annual landscape studies, we take the yearly count of unique phishing attacks reported and divide by the number of IP addresses routed by each hosting network (ASN). We call this metric Yearly Hosting Network (ASNs) Phishing Attack Score:
Yearly hosting networks (AS) Phishing Attack Score =
(number of unique phishing attacks reported as hosted at an ASN /
number of IP addresses routed by that ASN) * 10,000
Note that the calculation of these two metrics yields different results (we use different inputs for the numerators in the division); in particular, one cannot draw any conclusion by comparing the scores from a quarterly phishing score against an annual phishing score. Instead, we encourage comparisons of quarterly phishing attack scores over time, as well as annual phishing attack scores over time.
We use a similar scoring metrics to compare small and large TLDs and small and large gTLD registrars.
Maliciously registered domain names
(also, malicious domain registrations).
We define a maliciously registered domain as a domain registered by a criminal to carry out a malicious or criminal act. For our studies, we distinguish maliciously registered domains from compromised domains, which we define as domain names that were registered for legitimate purposes but co-opted by criminals through some form of compromise.
For example, an attacker may hijack a legitimate user’s domain registrar account, alter the DNS to resolve a name or URL to a host that the attacker controls; here, the domain and DNS are compromised. An attacker may also exploit a vulnerability at a legitimate web hosting site, upload fake or malicious content to a web site, and create a phishing URL that points to the malicious content at the legitimate web site; in this case, the web server is compromised.
This distinction is important because it often identifies where investigators should go for assistance with mitigation of the criminal activity:
If the domain is maliciously registered, an investigator will seek assistance from a domain name registrar, a TLD operator, or the operator that provides DNS for the malicious domain to suspend the domain name registration or name resolution.
For a compromised domain, such suspensions further victimize a legitimate party already victimized by the compromise, so investigators will contact the administrator of the compromised host to have the malicious content removed.
Note that parties that discover phishing pages will do their best to blocklist URLs that identify malicious content to avoid further victimization, whereas they may block maliciously registered domain names (and thus all hostnames and URLs created using this name) to contain the pervasive malicious activity.
For this measurement, we consider:
1.The age of the domain name — the number of days elapsed between domain registration and the use of the domain for a malicious purpose. In general, the older the domain name, the higher the likelihood it will legitimate. Miscreants tend to use their domains within the first year of registration, before they must pay for renewal. The shorter the time between registration and use for phishing, the more likely the domain was maliciously registered.
2. The content of the domain name. We apply rules to determine whether the composition of the name contains indicators of misuse or harmful intent, for example, the presence of a famous brand , a misspelled brand or a string intended to resemble a brand.
When the above criteria identify domains, we then look for clear evidence of common control and usage as an indicator to flag additional domains in a batch.
Our approach is similar to a methodology described in a research paper, COMAR: Classification of Compromised versus Maliciously Registered Domains. For our studies and analyses, we flag a domain name as malicious if the name (or delegated hostname) was reported for phishing within seven days of being registered, which is more conservative than the thirty (30) days used for the COMAR classification.
Hosting Network (ASN)
Autonomous system (AS) is a term used to describe a collection of networks that operate under a common administration. Its primary use is to identify peers and destinations in the global Internet routing system. Conceptually, routing at the AS level is a function of (i) identifying the autonomous systems that are adjacent to your AS, (ii) learning from your AS peers which destination autonomous systems you can reach by routing through them, and (iii) choosing which peer to use to optimally forward traffic to a given destination AS.
Autonomous systems are assigned unique numbers (ASNs) as part of the registration process that is required for operators to participate in the global Internet routing system. Autonomous system numbers are a sort of “shorthand”. Each number “represents” a list of IP address blocks (or IP prefixes) that are “reachable” in the AS. Cyber investigators are interested in destination AS numbers because this is the hosting network wherein an IP address that reportedly hosts phishing, malware, or other criminal site (or content) is located.
Whois services operated by Regional Internet Registries (ARIN, APNIC, AFRNIC, LACNIC, RIPE) provide registration data, including contact data, for autonomous systems and the IP address blocks that were allocated to autonomous system registrants. Some organizations operate several or dozens of autonomous systems. While we study the degree of “churn” in autonomous systems – adds and drops of IP address block allocations to an AS - we only report on individual autonomous systems, by number. Thus, we can call attention to individual hosting networks (ASN) where there are interesting concentrations of cybercrime activity, but we continue to explore ways to
identify organizations where interesting concentrations of cybercrime activity are present across several of the autonomous systems under that organization’s administration.
Phishing Target Identification
We use URL blocklists that identify targets in the metadata included in phishing reports. Reports from each phishing feed we consume varies slightly in its granularity and nomenclature. We compile lists of these variations and normalize spelling as part of our curation; for example, if one feed uses “PayPal” while another uses “PayPal Inc.”, we treat these as one target and normalize our data to a common form of the company name so that we can analyze brand data.
Some feeds pose classification differences. For example, WhatsApp is owned by Facebook. Some sources report WhatsApp as a separate brand, but other sources report the same WhatsApp phishing URLs as attacks against Facebook. We use the target reported by each feed, with the granularity (discrimination) that feed offers.
In some cases, one source may positively identify a URL as a phish against a specific target, but another source may only report the same URL as a phishing attack against “unknown” or “generic” brand. In these cases, we use the most detailed information available and attribute that attack to the specific brand. In the cases where an attack’s target is not determined by any feed, we set those attacks aside when analyzing brand data.
We do not rely exclusively on target identification that we find in URL blocklist metadata.We examine hostnames and URL PATHs for identical brand string matches, look-alike strings, and other deceptive strings that phishers employ. We also selectively apply fuzzy matching for brands that are common, persistent phishing targets.
DNS Data
Some of our threat intelligence source feeds provide IP (A record) data and AS data. We retain these data, but for our studies, we want addressing information for every hostname reported, so we also query for the A record of every reported domain name that we collect and determine the AS by using Team Cymru’s IP to ASN mapping service. We use RIPE-NCC WHOIS to find AS name, organization, and IP prefix(es). We obtain the number of IPv4 addresses in an AS from BGPview by first querying for the IP prefixes allocated to an AS and then calculating the number of IPv4 addresses as the sum of addresses represented by the prefixes.
To identify TLDs we begin with the IANA root zone list. We also use the Public Suffix List to identify the zones in which registries offer third level registration, for example, names assigned (delegated) from co.uk. For gTLD domain names we obtain registry WHOIS to identify the sponsoring registrar, along with the registrar’s IANA ID for normalization.
In our tables and measurements, the number of domains in each gTLD, and the number of gTLD domains sponsored by each registrar are obtained from the monthly ICANN reports for the latest month available when we begin writing a report or quarterly summary. Reference to domains under management (DUM) are also made to NTLDSTATS.com. ccTLD domain counts are obtained from the web sites of the registry operators and from DomainTools.
Phishing feeds
We consume two types of phishing reporting services (feeds): URL block lists (URLBLs) and domain block lists (DBLs or DNSBLs).
Our URL source feeds for phishing – APWG, OpenPhish, and PhishTank – identify a target brand for each report. These sources determine target by heuristics – e.g., they parse the content of the email phishing lure, match the logo images and wordings on the phishing site to the legitimate brand site content, etc. - or by manual verification.
We also use the Spamhaus DBL. This feed does not provide target information but it does classify domains according to the type of threat the domain is used to perpetrate. For phishing reporting, we use only the DBL response codes 127.0.1.4 (phish domain) and 127.0.1.104 (abused legit phish). We do not include Spamhaus DBL-listed domains when we analyze brand data.