Stripping out subdomains using regex in Hadoop Hive

I’ve been working on another query – the analysis of data usage across the network. As a result of this, I want to understand how time of day and the day of the month affect web usage. For example, what day is social media at its most popular? By understanding this, we can ascertain the best day and time to post new competitions, updates and images to the social platforms. Further to this, we can understand how usage is impacted regionally and with a little more information, we could even understand the differences in usage patterns between user age groups.

The query I put together is included below. It removes the subdomain & just leaves the domain + domain extension. For example, ‘graph.facebook.com’ would become ‘facebook.com’.

SELECT hour, domain, count(*) as transactions,
CASE
WHEN SUBSTR(domain,-6) = '.co.uk' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.co\.uk', 0) WHEN SUBSTR(domain,-4) = '.com' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.com', 0) WHEN SUBSTR(domain,-3) = '.uk' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.uk', 0) WHEN SUBSTR(domain,-4) = '.org' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.org', 0) WHEN SUBSTR(domain,-4) = '.net' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.net', 0) WHEN SUBSTR(domain,-6) = '.co.nz' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.co\.nz', 0) WHEN SUBSTR(domain,-3) = '.ie' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.ie', 0) WHEN SUBSTR(domain,-3) = '.it' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.it', 0) WHEN SUBSTR(domain,-7) = '.gov.uk' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.gov\.uk', 0) WHEN SUBSTR(domain,-5) = '.news' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.news', 0) WHEN SUBSTR(domain,-6) = '.co.in' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.co.in', 0) WHEN SUBSTR(domain,-3) = '.io' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.io', 0) WHEN SUBSTR(domain,-3) = '.tw' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.tw', 0) WHEN SUBSTR(domain,-3) = '.es' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.es', 0) WHEN SUBSTR(domain,-3) = '.pe' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.pe', 0) WHEN SUBSTR(domain,-3) = '.ca' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.ca', 0) WHEN SUBSTR(domain,-3) = '.de' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.de', 0) WHEN SUBSTR(domain,-3) = '.to' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.to', 0) WHEN SUBSTR(domain,-3) = '.us' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.us', 0) WHEN SUBSTR(domain,-3) = '.br' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.br', 0) WHEN SUBSTR(domain,-3) = '.im' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.im', 0) WHEN SUBSTR(domain,-3) = '.ws' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.ws', 0) WHEN SUBSTR(domain,-3) = '.gr' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.gr', 0) WHEN SUBSTR(domain,-3) = '.cc' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.cc', 0) WHEN SUBSTR(domain,-3) = '.cn' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.cn', 0) WHEN SUBSTR(domain,-3) = '.me' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.me', 0) WHEN SUBSTR(domain,-3) = '.be' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.be', 0) WHEN SUBSTR(domain,-3) = '.tv' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.tv', 0) WHEN SUBSTR(domain,-3) = '.ru' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.ru', 0) WHEN SUBSTR(domain,-3) = '.cz' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.cz', 0) WHEN SUBSTR(domain,-3) = '.st' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.st', 0) WHEN SUBSTR(domain,-3) = '.eu' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.eu', 0) WHEN SUBSTR(domain,-3) = '.fi' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.fi', 0) WHEN SUBSTR(domain,-3) = '.jp' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.jp', 0) WHEN SUBSTR(domain,-3) = '.ai' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.ai', 0) WHEN SUBSTR(domain,-3) = '.at' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.at', 0) WHEN SUBSTR(domain,-3) = '.ch' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.ch', 0) WHEN SUBSTR(domain,-3) = '.ly' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.ly', 0) WHEN SUBSTR(domain,-3) = '.fr' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.fr', 0) WHEN SUBSTR(domain,-3) = '.nl' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.nl', 0) WHEN SUBSTR(domain,-3) = '.se' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.se', 0) WHEN SUBSTR(domain,-4) = '.cat' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.cat', 0) WHEN SUBSTR(domain,-7) = '.com.au' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.com\.au', 0) WHEN SUBSTR(domain,-7) = '.com.ar' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.com\.ar', 0) WHEN SUBSTR(domain,-7) = '.com.mt' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.com\.mt', 0) WHEN SUBSTR(domain,-7) = '.com.co' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.com\.co', 0) WHEN SUBSTR(domain,-7) = '.org.uk' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.org\.uk', 0) WHEN SUBSTR(domain,-7) = '.com.mx' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.com\.mx', 0) WHEN SUBSTR(domain,-5) = '.tech' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.tech', 0) WHEN SUBSTR(domain,-5) = '.life' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.life', 0) WHEN SUBSTR(domain,-5) = '.mobi' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.mobi', 0) WHEN SUBSTR(domain,-5) = '.info' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.info', 0) WHEN SUBSTR(domain,-6) = '.ninja' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.ninja', 0) WHEN SUBSTR(domain,-6) = '.today' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.today', 0) WHEN SUBSTR(domain,-6) = '.earth' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.earth', 0) WHEN SUBSTR(domain,-6) = '.click' THEN regexp_extract(domain,'(?<=\.)[a-z0-9-]*\.click', 0) else domain END as cleaned_domain FROM database.table where date =''' +now_date+''' and hour = ''' +now_hour+ ''' group by hour, domain