Exercise 3: Creating a transformation¶
We will be using the files containing the results of previous final exams, to create new information. This information will be used for decision making, statistics, etc. In the first task, we will create a transformation, which takes the files, parses them, creates yearly and other statistics.
-
If you have started the Pentaho Data Integration correctly, the following screen will appear:
-
Create a new transformation, by choosing File -> New -> Transformation. Save this file named
ex3-transformation1.ktr
, in the working directory. -
Using the search bar on the left panel, place a Get file names node on the screen.
-
Double click on the placed node to open the context menu:
- In the panel above the table, we can define a file or folder to use. When done, pressing the Add button will put it in the table.
- The browse button opens a file picker, if you want to pick a directory, you first choose a file, then delete the last part of the path manually. Choose C:\Pentaho\data\ZV\
- Use the regular expresion input, to limit the search for csv files. Use the following pattern: .*csv To make sure you have correctly set up the node, press the Preview Rows button, you should see something like this
-
Place a CSV File input node on the transformation. To connect it with our Get File Names node, hover over the former and select the rightmost box.
The connection should look like this:
Be careful!
If you click on the arrow, it will turn grey, indicating the connection is disabled! You should never see this:
-
Double click on the “CSV File Input” node and set the following:
- The filename field -> filename
- Include filename field -> check (we will use it to parse the date)
- Delimiter -> ;
- Enclosure -> delete the content
- Lazy conversion -> uncheck
- Add the columns from the file (Identifier - Integer, Neptun code - String, Exam score - Integer, Application score - Integer, Grade - Number)
Running the whole transformation now (by pressing the play button), should lead to both our nodes finishing successfully. When clicking on the Preview data tab on the bottom, we should see the following:
-
We want to include the date of the exam for each data entry. To do this we have to parse this information from the filename field. The first step is to cut the string to the necessary length, by using a String cut node. Place it and connect it to the CSV File input. Double clicking on it should bring up the following screen:
- Set the In stream field to filename,
- Set the Out stream field to Date, this will make the node put its output (the cut stirng) in a new field.
- Set the Cut from and Cut to, to cut the filename from the absolute path (19 and 25 if you use C:\Pentaho as the working directory)
-
Place a Replace in string node and connect it to the previous node. Double click and set it up to replace the _ (underscore) with a / (slash) in the field Date:
The transformation looks like the following now:
If everything was done correctly, you should see the following output in the Preview data.
-
Place a Filter rows node, we will use this to filter out the rows where the Grade (Osztályzat) is invalid. The Filter node has a hierarchical Boolean expression builder. Set it up as the following:
-
Place a Data validator node. We will use this to validate the proper formatting of the Neptun code. In order to do this, double click on the node, and add a new validation rule by clicking on the New validation button.
- Check the Report all errors, not only the first
- Select Neptun code in the Name of the field to validate
- Put this regular expression in the Regular expression expected to match cell: [A-Z0-9]{6}
After running the transformation, we should see that our validator has found an error:
-
To proceed, we have to handle this error. The easiest way to do this, is to append a Text file output node to the Error handling of the step output of the Data Validator.
In the Text File Output, we define an output file, for example C:\Pentaho\result\invalid_neptun.txt
If done correctly, the transformation should no longer fail, and the output file should contain the invalid rows:
-
Next, we place a Sort Rows node. This will be used to presort the data for our Group by node, which only works on data sorted by the grouping key. After placing and connecting the node, set the Fieldname to Date, and leave the other cells on default values.
IMPORTANT
If Pentaho asks whether to copy or distribute the data as the output of this step choose copy.
-
Place a Group By node and connect it to the Sort Rows node.
Set it up as the following:
- Group field (the key which will be used to create the groups): Date
- Aggregates (the functions ran on the groups)
- Name: Number of students, Subject: -, Type: Number of rows (without field argument)
- Name: Average grade, Subject: Grade, Type: Average (Mean)
If everything is working correctly, you should see the aggregated results at the Preview data tab of the Group by.
-
In this next part, we will be creating a report on those students, who took the final exam more than once. For this, we will be creating a new path in the transformation. Place a Filter rows node after our previously placed Data Validator. This filter will filter out those rows, which doesn’t contain a Neptun code. This is necessary, because Neptun is the only way to identify a student between multiple exams. The node is configured as follows:
-
Next we place a Sort rows node, which will presort the date for our uniqueness checker (in much the same way as for the Group by). Configure this sorting step as the following:
Write your NEPTUN code after the step name!
-
Place a Unique rows node after the sort and set it up as the following (note that Redirect duplicate row is checked).
-
Finally place a Microsoft Excel Output at the Error handling of the Unique rows. This will place the duplicate rows (students who took more than one final exam) in this Excel sheet. Set it up to save the file C:\Pentaho\result\multiple_exams.xls
The final transformation should look like this:
Warning!
Note, that opening the file will cause Excel to lock it, therefore any subsequent runs will fail, while the file is open!
SUBMISSION
Save the final transformation as ex3-transformation1.ktr
Create a screenshot of the executed transformation flow and save it as ex3-transformation1-flow.png
. Make sure that all the nodes including your Neptun code are visible on this screenshot.