31 August 2019

If we can't tidy up the NZ emissions trading scheme can we tidy up the dataframe of the free allocation of units

I do some data cleaning and tidy up the EPA's table of 2018 free giveaway emissions units.

In this follow up post about the EPA's non-tidy table of 2018 industrial allocation/free giveaway of emissions units, I use a great opensource programme OpenRefine to tidy the allocation data into a 'tidy' format of 'each variable is a column, each row is a an observation and each cell is a value'.

I have recorded my commands as much for my own benefit in the future if and when I try to replicate the commands. There's a joke in the reproducible research online community "the hardest person to email questions to is yourself three years ago".

Assuming you have installed OpenRefine to your Linux Debian based operating system, open a terminal window and type:

cd /home/user/Refine/openrefine-3.2/

Type './refine' and press enter

Wait for Firefox to start at IP http://127.0.0.1:3333/ which will start OpenRefine

Go to the Google sheet obtained from the EPA webpage https://docs.google.com/spreadsheets/d/1arfDpqiXg84SwTAiY5TWzDxOJNrnJxvG9GgyBY8jCRM/ and download the .csv file to '/home/user/Downloads'

Copy the downloaded .csv file to /home/user/Refine/openrefine-2.6-beta.1/

Go back to Firefox and enter "http://127.0.0.1:3333/" into the address bar. That will open OpenRefine.

In OpenRefine, select the button "Create a project by importing data"

Browse to and select /Refine/openrefine-2.6-beta.1/NZ-emission-unit-industrial-allocation-decisions-EPA-2018 - Sheet1.csv

Click on 'Next' button and 'create new project'

Select and tick 'ignore first line at the beginning of the file

Tick 'Parse next 1 line as column headers'

Click on 'create new project'

We should have 108 rows of data - look at the second column, it mixes two variables, 'Applicants name' and 'Activity'

Edit column - Add new column 'Activity' based on column *Activity and Applicant's Name* - add name 'Activity'

write " if(value.startsWith("*"), value[1,37],"")" into the Expression box. That moves only the activities into their own column.

Edit column, Add column based on column Applicants Name called 'Name'

in the Expression box , leave 'value' in box and copy the column by selecting 'ok'

Select Activity column, edit cells, fill down (fills all Activities to empty cells)

Edit column - Add new column 'Year' based on column *Activity and Applicant's Name* - add name 'Year' and value '2018' in expression box

Select the header *2018 Final Unit Entitlement*, edit cells, common transforms, to number

Select the header *2018 Final Unit Entitlement*, select Facet, numeric facet, go to left side of dashboard, untick 'numeric' box, leave 'blank' box (24 records) ticked,

Select column 'All', then Edit rows, remove all matching rows (that leaves 84 rows with no blank cells in *2018 Final Unit Entitlement*)

Select column *Activity and Applicant's Name*, Edit column, Remove this column

Select column *Activity", Edit column - Add new column 'Activity' based on column 'Activity2' - add name 'Activity' and expression in box enter value.replace("*","") - to remove the *. And we have a tidy data table!

Click on the data project name at the top and just right of the "OpenRefine" label "NZ-emission-unit-industrial-allocation-decisions-EPA-2010-2018-Sheet1-csv". Change the name to "NZ emission unit industrial allocation decisions EPA 2018 tidy"

Select 'Export' (in top right corner) as a .csv file

Upload the csv file to Google Drive via the Gdrive command line utility

Open a xterminal window, for first upload enter;

gdrive upload /home/user/Refine/NZ-emission-unit-industrial-allocation-decisions-EPA-2018-tidy.csv

In Google sheets i changed the file's name to "NZ-emission-unit-industrial-allocation-decisions-EPA-2018-tidy" and I 'shared' the file to 'public'.

Download the file to your computer and open it with a spreadsheet program such as gnumeric.

We can see that we now have a tidy dataframe where each variable is a column, each row is a an observation and each cell is a value - the number of emissions units given away for free to greenhouse gas emitters under the NZ emissions trading scheme.

No comments:

Post a Comment