Skip to content

Exercise 3: Creating a transformation

We will be using the files containing the results of previous final exams, to create new information. This information will be used for decision making, statistics, etc. In the first task, we will create a transformation, which takes the files, parses them, creates yearly and other statistics.

  1. If you have started the Pentaho Data Integration correctly, the following screen will appear:

    Pentaho Data Integration

  2. Create a new transformation, by choosing File -> New -> Transformation. Save this file named ex3-transformation1.ktr, in the working directory.

  3. Using the search bar on the left panel, place a Get file names node on the screen.

    Get file names search

  4. Double click on the placed node to open the context menu:

    Get file names

    • In the panel above the table, we can define a file or folder to use. When done, pressing the Add button will put it in the table.
    • The browse button opens a file picker, if you want to pick a directory, you first choose a file, then delete the last part of the path manually. Choose C:\Pentaho\data\ZV\
    • Use the regular expresion input, to limit the search for csv files. Use the following pattern: .*csv To make sure you have correctly set up the node, press the Preview Rows button, you should see something like this

    Get file names list

  5. Place a CSV File input node on the transformation. To connect it with our Get File Names node, hover over the former and select the rightmost box.

    Hover

    The connection should look like this:

    Connection

    Be careful!

    If you click on the arrow, it will turn grey, indicating the connection is disabled! You should never see this:

    Connection disbaled

  6. Double click on the “CSV File Input” node and set the following:

    • The filename field -> filename
    • Include filename field -> check (we will use it to parse the date)
    • Delimiter -> ;
    • Enclosure -> delete the content
    • Lazy conversion -> uncheck
    • Add the columns from the file (Identifier - Integer, Neptun code - String, Exam score - Integer, Application score - Integer, Grade - Number)

    CSV input

    Running the whole transformation now (by pressing the play button), should lead to both our nodes finishing successfully. When clicking on the Preview data tab on the bottom, we should see the following:

    CSV preview

  7. We want to include the date of the exam for each data entry. To do this we have to parse this information from the filename field. The first step is to cut the string to the necessary length, by using a String cut node. Place it and connect it to the CSV File input. Double clicking on it should bring up the following screen:

    String cut

    • Set the In stream field to filename,
    • Set the Out stream field to Date, this will make the node put its output (the cut stirng) in a new field.
    • Set the Cut from and Cut to, to cut the filename from the absolute path (19 and 25 if you use C:\Pentaho as the working directory)
  8. Place a Replace in string node and connect it to the previous node. Double click and set it up to replace the _ (underscore) with a / (slash) in the field Date:

    Replace in string

    The transformation looks like the following now:

    Transformation1

    If everything was done correctly, you should see the following output in the Preview data.

    Preview data

  9. Place a Filter rows node, we will use this to filter out the rows where the Grade (Osztályzat) is invalid. The Filter node has a hierarchical Boolean expression builder. Set it up as the following:

    Filter rows

  10. Place a Data validator node. We will use this to validate the proper formatting of the Neptun code. In order to do this, double click on the node, and add a new validation rule by clicking on the New validation button.

    Data validator

    • Check the Report all errors, not only the first
    • Select Neptun code in the Name of the field to validate
    • Put this regular expression in the Regular expression expected to match cell: [A-Z0-9]{6}

    After running the transformation, we should see that our validator has found an error:

    Transformation2

  11. To proceed, we have to handle this error. The easiest way to do this, is to append a Text file output node to the Error handling of the step output of the Data Validator.

    Transformation3

    In the Text File Output, we define an output file, for example C:\Pentaho\result\invalid_neptun.txt

    Text file output

    If done correctly, the transformation should no longer fail, and the output file should contain the invalid rows:

    Invalid neptun

  12. Next, we place a Sort Rows node. This will be used to presort the data for our Group by node, which only works on data sorted by the grouping key. After placing and connecting the node, set the Fieldname to Date, and leave the other cells on default values.

    IMPORTANT

    If Pentaho asks whether to copy or distribute the data as the output of this step choose copy.

    Sort rows

  13. Place a Group By node and connect it to the Sort Rows node.

    Transformation4

    Set it up as the following:

    • Group field (the key which will be used to create the groups): Date
    • Aggregates (the functions ran on the groups)
      • Name: Number of students, Subject: -, Type: Number of rows (without field argument)
      • Name: Average grade, Subject: Grade, Type: Average (Mean)

    Group by

    If everything is working correctly, you should see the aggregated results at the Preview data tab of the Group by.

  14. In this next part, we will be creating a report on those students, who took the final exam more than once. For this, we will be creating a new path in the transformation. Place a Filter rows node after our previously placed Data Validator. This filter will filter out those rows, which doesn’t contain a Neptun code. This is necessary, because Neptun is the only way to identify a student between multiple exams. The node is configured as follows:

    Filter rows

  15. Next we place a Sort rows node, which will presort the date for our uniqueness checker (in much the same way as for the Group by). Configure this sorting step as the following:

    Write your NEPTUN code after the step name!

    Sort rows

  16. Place a Unique rows node after the sort and set it up as the following (note that Redirect duplicate row is checked).

    Unique rows

  17. Finally place a Microsoft Excel Output at the Error handling of the Unique rows. This will place the duplicate rows (students who took more than one final exam) in this Excel sheet. Set it up to save the file C:\Pentaho\result\multiple_exams.xls

    Excel output

    The final transformation should look like this:

    Final transformation

Warning!

Note, that opening the file will cause Excel to lock it, therefore any subsequent runs will fail, while the file is open!

SUBMISSION

Save the final transformation as ex3-transformation1.ktr

Create a screenshot of the executed transformation flow and save it as ex3-transformation1-flow.png. Make sure that all the nodes including your Neptun code are visible on this screenshot.

Back to top