Journey of Data: CSV → Postgres

5 min readApr 4, 2023

Welcome to this blog, where you’ll learn how to load CSV data into a Postgres database and improve data quality by removing vertical string duplicates. But that’s not all — I’ll also demonstrate basic query comparison with and without an index, so you can optimize your data queries and achieve faster results.

Prerequisites

Postgres
psql or DataGrip or pgAdmin
data

Gathering Data

This data is daily temperature of different cities along with their country & region. Please go to link to download Temperature of All Countries (1995–2020) dataset.

Creating a User and Database

During this course, I’ll be using psql as my primary tool for managing data in Postgres. However, the good news is that you can easily copy and paste the code snippets and try them out in your preferred Postgres management tool, such as DataGrip or pgAdmin.

-- login to root user (then type password)
psql -h localhost -p 5432 -U postgres postgres;

-- creating user
create user hasanozdem1r with password '12345';

-- creating db
create database city_temperature with owner 'hasanozdem1r';

-- exit root user
\q

-- connect created user
psql -h localhost -p 5432 -U hasanozdem1r city_temperature;

Load raw CSV data into Postgres

We should investigate CSV columns to choose properly table fields.

region: Categorical, duplicates, str
country: Categorical, duplicates, str
state: Categorical data, duplicates, str
city: Categorical, duplicates, str
month: integer
day: integer
year: integer
avgtemperature: decimal

We need to create raw table which will take care all existing columns properly.

-- create table
create table if not exists temperature_raw(
    id serial,
    region_id integer,
    region varchar(64),
    country_id integer,
    country varchar(64),
    state_id integer,
    state_name varchar(64),
    city_id integer,
    city varchar(64),
    month_of_year integer,
    day_of_year integer,
    year integer,
    avg_temperature decimal(10,2)
);
-- investigate table
\d+ temperature_raw;
-- load csv onto temperature_raw table
\copy temperature_raw(region,country,state_name,city,month_of_year,day_of_year,year,avg_temperature) 
from 'city_temperature.csv' (delimiter ',' ,format csv,HEADER);

\COPY moves data between PostgreSQL tables and standard file-system files.

Take into consideration following details while use \copy command;

Table must be already existed
Table fields must match with the csv file columns. If you don’t pass the table fields it will do automatically from left to right but explicity define will make it easier.
Before load any CSV you must investigate and make sure create proper field types. It will raise error if it’s not properly defined.

Normalize Raw Table

Vertical duplication in tables can cause inefficiencies and redundancy in your data. It occurs when the same information is repeated in multiple rows of a table. One effective way to avoid this is by normalizing your data. Normalization involves breaking down your data into smaller tables and linking them using foreign keys. This approach ensures that each piece of information is stored only once, which can help optimize data storage and avoid inconsistencies in your data. By normalizing your data, you can create a more efficient and effective database design.

Final table structure has following relationships

Region -> Temperature (One to Many)
Country -> Temperature (One to Many)
State -> Temperature (One to Many)
City -> Temperature (One to Many)

-- region table
create table Region(
  id serial,
  name varchar(128),
  primary key(id)
);
-- country table
create table Country(
  id serial,
  name varchar(128),
  primary key(id)
);
-- state table
create table State(
  id serial,
  name varchar(128),
  primary key(id)
);
-- city table
create table City(
  id serial,
  name varchar(128),
  primary key(id)
);
-- temperature table
create table Temperature(
  id serial,
  region_id integer references Region(id) on delete cascade,
  country_id integer references Country(id) on delete cascade,
  state_id integer references State(id) on delete cascade,
  city_id integer references City(id) on delete cascade,
  month_of_year integer,
  day_of_year integer,
  year integer,
  avg_temperature decimal(10,2),
  primary key(id)
);

Loading data into related tables

-- city
insert into city(name) select distinct city from temperature_raw;
-- state
insert into state(name) select distinct state_name from temperature_raw;
-- region
insert into region(name) select distinct region from temperature_raw;
-- country
insert into country(name) select distinct country from temperature_raw;

Update foreign keys by using lookup tables

-- update city_id
update temperature_raw set city_id=
(select city.id from city where city.name=temperature_raw.city limit 1);
-- update state_id
update temperature_raw set state_id=
(select state.id from state where state.name=temperature_raw.state_name limit 1);
-- update region_id
update temperature_raw set region_id=
(select region.id from region where region.name=temperature_raw.region limit 1);
-- update_country_id
update temperature_raw set country_id=
(select country.id from country where country.name=temperature_raw.country limit 1);

Load updated data to Temperature table

insert into Temperature(
  region_id,
  country_id,
  state_id,
  city_id,
  month_of_year,
  day_of_year,
  year,
  avg_temperature)
select 
  region_id,
  country_id,
  state_id,
  city_id,
  month_of_year,
  day_of_year,
  year,
  avg_temperature
from temperature_raw;

Let’s look at our data

select region.name,
       country.name as country,
       state.name as state,
       city.name as city,
       month_of_year,
       day_of_year,
       year,
       avg_temperature
from temperature
join region on temperature.region_id=region.id
join country on temperature.country_id=country.id
join state on temperature.state_id=state.id
join city on temperature.city_id=city.id
order by country.name
limit 5;

Analyze Query without Index

explain analyze select region.name,
       country.name as country,
       state.name as state,
       city.name as city,
       month_of_year,
       day_of_year,
       year,
       avg_temperature
from temperature
join region on temperature.region_id=region.id
join country on temperature.country_id=country.id
join state on temperature.state_id=state.id
join city on temperature.city_id=city.id
order by country.name
limit 5;

The following screenshot displays the result of the above query.

Analyze Query with Index

-- index for region_id
CREATE INDEX idx_temperature_region_id ON Temperature (region_id);
-- index for country_id
CREATE INDEX idx_temperature_country_id ON Temperature (country_id);
-- index for state_id
CREATE INDEX idx_temperature_state_id ON Temperature (state_id);
-- index for city_id
CREATE INDEX idx_temperature_city_id ON Temperature (city_id);

NOTE: Creating too many indexes can slow down write operations, so it’s important to only create indexes that are necessary for your queries.

The following screenshot displays the result of the previous query.

Planning Time shown by EXPLAIN ANALYZE is the time it took to generate the query plan from the parsed query and optimize it. It does not include parsing or rewriting.

Execution Time shown by EXPLAIN ANALYZE includes executor start-up and shut-down time, as well as the time to run any triggers that are fired, but it does not include parsing, rewriting, or planning time.

Questions ?

We have totally 2,906,327 data in temperature table.

We had almost similar performance with index and without index. How we can improve query performance ?
What is sequential scan ?
What will be your approach to convert CSV data to tables ?