Stripping out subdomains using regex in Hadoop Hive
I’ve been working on another query – the analysis of data usage across the network. As a result of this, I want to understand how time of day and the day of the month affect web usage. For example, what day is social media at its most popular? By understanding this, we can ascertain the best day and time to post new competitions, updates and images to the social platforms. Further to this, we can understand how usage is impacted regionally and with a little more information, we could even understand the differences in usage patterns between user age groups.
The query I put together is included below. This includes 3 regex (regular expression) statements:
- The first, strips the TLD from the domain (e.g. .com, .net, .co.uk and so on…).
- The second extracts the main bit of the domain (e.g. www.facebook.com becomes simply Facebook)
- The final regex extracts the subdomain (e.g. itunes.apple.com becomes itunes)
REGEXP_EXTRACT(domain,'([^.:]+):?[0-9]$') as tld,
REGEXP_EXTRACT(domain,'([^.:]+).[^.:]+:?[0-9]$') as cleandomain,
REGEXP_EXTRACT(domain,'([^.:]+).[^.:]+.[^.:]+:?[0-9]$') as subdomain,
date, hour, category, sum(bytes) as totalbytes, sum(bytes)/1024000 as mb, count() as hits
where date = 20170420
group by regexp_extract(domain,':%s/^.+.\ze[^.]+.[^.]+$//'), REGEXP_EXTRACT(domain,'([^.:]+):?[0-9]$'), date, hour, domain,
sort by bytes desc
I’ve separated these to enable holistic and granular analysis. So, if you want to know the amount of traffic across all Facebook domains (m.facebook.com, facebook.com, l.facebook.com etc..), you can simply search for ‘where cleandomain like ‘facebook’. However, the same data set will also enable you to concatenate the fields and look at usage across each of the Facebook subdomains – making this single query much more useful in a visualization tool such as Tableau than a full URL would have been.
In subsequent articles, I’ll be talking about how we can read, understand and build our own regex statements – ranging from the super simple to the relatively complex.