Exercise 3: Initializing Elasticsearch¶

Now that both Elasticsearch and Kibana are operational, let us create the necessary indices for us to work with in the following exercises.

Index a document via PowerShell¶

First, we are going to use Elasticsearch's REST API through PowerShell.

To index a document in Elasticsearch, issue the following command.
```
(Invoke-WebRequest 'http://localhost:9200/test/_doc/1?pretty' -Method Put -ContentType 'application/json' -Body '{ "name": "John Doe" }' -UseBasicParsing).Content
```
This way, we inserted a document of type _doc into the index called test with id 1. The response JSON should state "result": "created".

Query the document with the following command.

(Invoke-WebRequest 'http://localhost:9200/test/_doc/1?pretty' -Method Get -UseBasicParsing).Content

The result JSON tells us the name of the index, the document's id, and the entire document we inserted in the _source field.

{
  "_index": "test",
  "_type": "_doc",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "name": "John Doe"
  }
}

Create an index and index a document using Kibana¶

In this part of the exercise, we will create an index for documents containing information about people working in the fast-food industry. Here is a sample document.

Sample document¶

When using this sample document, make sure to replace the Neptun code with yours all uppercase in the gender and company fields. The final value should look like this: ABC123 female and ABC123 Subway respectively.

{
  "gender": "NEPTUN female",
  "firstName": "Evelyn",
  "lastName": "Petersen",
  "age": 17,
  "phone": "+1 (900) 503-3892",
  "address": {
    "zipCode": 63775,
    "state": "NY",
    "city": "Lynn",
    "street": "Clarkson Avenue",
    "houseNumber": 503
  },
  "salary": 87217,
  "company": "NEPTUN Subway",
  "email": "evelyn.petersen@subway.com",
  "hired": "09/29/2009"
}

We are going to use Kibana's Dev Tools for this part of the exercise. Although it uses the same REST API that we used through PowerShell, it provides a more convenient GUI for us to use. In this Dev tool, we can run queries.

Kibana Dev Tools

A query in Kibana's Dev Tools contains an http verb and an URL matching Elasticsearch' REST API in the first line, following with a body as JSON. Copy the text from below then press the Play button in the top right corner of the editor.
```
PUT salaries
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "gender": { "type": "keyword" },
      "address.state": { "type": "keyword" },
      "company": { "type": "keyword" },
      "hired": {
        "type": "date",
        "format": "MM/dd/yyyy"
      }
    }
  }
}
```
The settings we use here are the following.
- settings: We set the number of shards and replicas here. While settings the number of shards is not that important here, we must set the number of replicas to zero to have an index with green health value. Elasticsearch refuses to put a shard and its replica on the same node, and we only have a single node.
- mapping: Mapping is the "schema" of the data. It is not mandatory to set this, but Elasticsearch will choose how to interpret data when it is ambiguous unless we specify the mapping.
  - gender, address.state, company: These are values we know are only going to have a few select values (e.g., "male" and "female" for gender), therefore we do not want to allow free text search on them. We can help the system by specifying this.
  hired: Although this is a date field, the date representation is not standard — Elasticsearch wouldn't recognize it by itself. Therefore we have to specify the date format explicitly.
We can check the indices with the GET _cat/indices?v query. (Use the Dev Tools to execute this query too.)

Note how the test index's health is yellow, and the health of the salaries index is green. That is because the default value for the number of replicas is 1.

Insert the sample document into the created index.

Before executing this query do not forget to edit the Neptun code in the gender and company fields.

POST salaries/_doc
{
  "gender": "NEPTUN female",
  "firstName": "Evelyn",
  "lastName": "Petersen",
  "age": 17,
  "phone": "+1 (900) 503-3892",
  "address": {
    "zipCode": 63775,
    "state": "NY",
    "city": "Lynn",
    "street": "Clarkson Avenue",
    "houseNumber": 503
  },
  "salary": 87217,
  "company": "NEPTUN Subway",
  "email": "evelyn.petersen@subway.com",
  "hired": "09/29/2009"
}

Executing the query will yield a similar result (on the right side of the window). This is the response of the POST query with the id of the document inserted.

Elasticsearch created document

We can use the _id value from the response to query the document.

GET salaries/_doc/eZSmaGkBig5GeeBFsFG6

Modify the input data¶

Before importing the rest of the sample data, add your Neptun code as a prefix to some of the values in the salaries.json file too:

Each gender value shall be prefixed, e.g. "gender":"NEPTUN female"
Each company value shall be prefixed, e.g. "company":"NEPTUN McDonalds"
Find the salaries.json file in the root of the repository. Open a PowerShell console here.
Edit the following command by adding your Neptun code all uppercase, then execute it in PowerShell (do NOT change the quotation marks, only edit the 6 characters of the Neptun code!):
```
(Get-Content .\salaries.json) -replace '"gender":"', '"gender":"NEPTUN ' -replace '"company":"', '"company":"NEPTUN ' | Set-Content .\salaries.json
```
Verify the results; it should look similar (with your own Neptun code):

The file must be a valid JSON! Please double-check the quotation marks around the values. If the result is not correct, you can revert the change made to this file using git (git checkout HEAD -- salaries.json) and then retry.

The modified file shall be uploaded as part of the submission.

IMPORTANT

Adding your Neptun code is a mandatory step. It will be displayed on visualizations created in the following exercises.

Index many documents using the bulk API¶

And now, let us index these documents.

We can add multiple documents to the index using the bulk API. Issue the following command from the PowerShell window in the folder of the starter solution.
```
Invoke-WebRequest 'http://localhost:9200/_bulk' -Method Post -ContentType 'application/json' -InFile .\salaries.json -UseBasicParsing
```
Check the response for errors. You will see a similar message if everything is OK (note the errors in the response):

If you see a similar error, it means the source file changes resulted in an invalid json file.

If this happens, you need to start over:
1. Delete the salaries index by executing a DELETE salaries request in Kibana.
2. Go back to the index creation step, then repeat the index creation and indexing of the single document.
3. Reset the changes made to the salaries.json file, and retry the replacement with special care regarding the quotation marks.
4. Now, retry the bulk index request.
Execute a search using query GET salaries/_search (using Kibana). This will return a few documents and let us know how many documents there are (total number matching the query will be the total number of documents, due to the lack of filtering in this search). There should be 1101 documents.

If you see fewer documents, you might try using the Refresh API to ensure Elasticsearch is finished with all indexing operations. To trigger this, execute a POST salaries/_refresh request. Then check the count again. If it is still not correct, you need to start over.