Creating a new table from file in Hadoop Hive

Today, I was provided with a beta version of a data feed that would be consumed by the Hadoop platform. As it’s not been configured to run into the platform yet, there was no way to query the data to ensure we had all the raw data we’d eventually need to extract insights and run vital reports for business consumers.

The data I was provided was a feed from the test environment and was in CSV format. For this data to be useful, I wanted to load it into the Hadoop cluster and run some queries, calculations & aggregations on it using Hue. To to this, I needed to create a new table and populate it with my shiny new data set. So, I created a new folder in my user area called /newfeed and uploaded the CSV to this directory.

I then opened the Hive query engine & executed the below query:
create external table mynewtablename(
result string,
time_hour int,
time_minute int,
time_second int,
time_millisecond int,
duration int,
networkname string,
userid string,
total_bytes int,
total_hits int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
location '/user/username/newfeed/'
;

In the above query, it’s important to note a few things:

  • mynewtablename to be replaced with your desired table name
  • ‘/user/username/newfeed/’ to be replaced with the directory in which you placed your CSV
  • DELIMITED FIELDS TERMINATED BY “,” sets the delimiter to comma for the CSV file

If your CSV is in the correct directory & is structured as the new table is, the data will be available in the table when queried – simple huh?

Kieran Keene

view all posts

Join me on this career development project as I set out to develop the skills required to progress up the technology career ladder! Check out http://netshock.co.uk/about/ to find out more.