Learning Elasticsearch by running it in Docker

Elasticsearch is a powerful search engine based on the Apache Lucene library

This tutorial is about learning the basics of Elasticsearch, possibly going through all the topics one need to know to pass the certification exam (Elasticsearch Engineer I & II).

What you will learn

This tutorial doesn't exactly follow the content of the official training, but you will learn to run Elasticsearch on a small single-node cluster, running locally for free on your laptop.

More specifically, you will learn to do the following:

install Elasticsearch using Docker
install Elasticsearch by customizing your own Docker image, and configure Elasticsearch as you wish
configure it to load data by restoring a snapshot [incomplete]
configure it to process raw files and load their data into Elasticsearch [incomplete]
... more basic features of ELK (more coming!!!)

Looking for the official Elastic training and certifications, you should check these links:

Become an Elastic Certified Engineer

Elasticsearch Engineer I: learn how to manage deployments and develop solutions.
Elasticsearch Engineer II: Develop a deeper understanding of how Elasticsearch works and master advanced deployment techniques.
Elastic Certified Engineer: Test your Elasticsearch skills with our performance-based certification exam
How to register for a course?

Elastic Engineer I

https://training.elastic.co/instructor-led-training/ElasticsearchEngineerI-Virtual

Elastic Engineer II

https://training.elastic.co/instructor-led-training/ElasticsearchEngineerII-Virtual

Local installation of Elasticsearch

Normal steps to install a single-node Elastic environment would include:

Install Java
Download and Setup Elastisearch
Run Elasticsearch: <path_to_elasticsearch_root_dir>/bin/elasticsearch
Run Kibana: <path_to_kibana_root_dir>/bin/kibana
Verify that your installation is working:
- http://localhost:5601 – should display Kibana UI.
- http://localhost:9200 – should return status code 200.

Described below is another way to run Elasticsearch from within a container using Docker. This allows for a simpler, cleaner, (but temporary!!) installation of the Elastic stack which makes it practical for learning purposes.

Prerequisite for any installation of Elasticsearch

On Linux, use sysctl vm.max_map_count on the host to view the current value, and see Elasticsearch's documentation on virtual memory for guidance on how to change this value. Note that the limits must be changed on the host; they cannot be changed from within a container.

~$ sysctl vm.max_map_count 
vm.max_map_count = 65530

On Linux, you can increase the limits by running the following command as root:

sysctl -w vm.max_map_count=262144

To set this value permanently, update the vm.max_map_count setting in /etc/sysctl.conf
To verify after rebooting, run sysctl vm.max_map_count.

Dockerized version of Elasticsearch

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package.
By doing so, thanks to the container, the developer can rest assured that the application will run on any other Linux machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code.

This section describes how to use the sebp/elk Docker image, which provides a convenient centralised log server and log management web interface, by packaging Elasticsearch, Logstash, and Kibana, collectively known as ELK.

the official documentation.

Installation from an official Docker image

This type of installation is recommended to get started, but you might be limited later on when you need to configure Elasticsearch and restart it to apply the changes. So if you need to change your configuration, you will have to (re-)build your Docker image locally.

The RPM and Debian packages will configure this setting automatically. No further configuration is required.

To pull the image from the Docker registry, open a shell prompt and enter:

docker pull sebp/elk

or this one if you haven't configured docker for your own user:

sudo docker pull sebp/elk

Usage

Run a container from the image with the following command:

docker run -p 5601:5601 -p 9200:9200 -p 5044:5044 -it --name elk sebp/elk

Note - The whole ELK stack will be started. See the Starting services selectively section to selectively start part of the stack.

This command publishes the following ports, which are needed for proper operation of the ELK stack:

5601 (Kibana web interface): http://localhost:5601
9200 (Elasticsearch JSON interface): http://localhost:9200/
5044 (Logstash Beats interface, receives logs from Beats such as Filebeat – see the Forwarding logs with Filebeat section).

Installation from a custom Docker image

This procedure is of course more cumbersome than pulling a pre-made docker image, but this will allow us to tweak the configuration of our Elasticsearch instance.

Here we will:

clone the an official Docker recipe to build the Elasticsearch image
make changes to our configuration of Elasticsearch
build the ELK containers from that image
run the same way we would run the official Docker image

>> Cloning the official image:

/$ cd /tmp
/tmp$ git clone https://github.com/spujadas/elk-docker.git

>> Make your changes to the configuration of Elasticsearch:

...
coming example to be copied from the next section...

>> Build and run the ELK containers from that image:

You may need to remove any former container with the image name ‘elk' (maybe not necessary to check!!!!)

/tmp$ cd elk-docker/
/tmp/elk-docker$ docker-compose build elk
/tmp/elk-docker$ docker-compose up

Go take a cup of coffee or 3 :coffee::coffee::coffee:, it might take 5-10 minutes!

Configuring Elasticsearch

Documentation

Elasticsearch ships with good defaults and requires very little configuration. Most settings can be changed on a running cluster using the Cluster Update Settings API.

The configuration files should contain settings which are node-specific (such as node.name and paths), or settings which a node requires in order to be able to join a cluster, such as cluster.name and network.host.

You can configure:

the Elasticsearch Java Virtual Machine (JVM) with jvm.otions
the Elasticsearch logging with log4j2.properties
Elasticsearh itself with elasticsearch.yml

Important Elasticsearch configuration are mostly settings which need to be considered before going into production:

Path settings
Cluster name
Node name
Network host
Discovery settings
Heap size
Heap dump path
GC logging
Temp directory

You have 2 options to index the data into Elasticsearch.

You can either use the Elasticsearch snapshot and restore API to directly restore a dataset index from a snapshot,
or you can download the raw data from your favorite source and then process it to index the data.

Load data by restoring index snapshot

Example with NYC restaurants

Enter your local Elasticsearch single-node cluster by entering the Docker container running it:

$ docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                                              NAMES
bab86b772b05        sebp/elk            "/usr/local/bin/star..."   2 hours ago         Up 2 hours          5044/tcp, 5601/tcp, 9200/tcp, 9300/tcp                                             infallible_saha
$ docker exec -i -t bab86b772b05 /bin/bash

Using the option to restore a snapshot involves 4 easy steps:

Download and uncompress the index snapshot .tar.gz file into a local folder

# Create snapshots directory
mkdir elastic_restaurants
cd elastic_restaurants
# Download index snapshot to elastic_restaurants directory
wget http://download.elasticsearch.org/demos/nyc_restaurants/nyc_restaurants-5-4-3.tar.gz .
# Uncompress snapshot file
tar -xf nyc_restaurants-5-4-3.tar.gz

This adds a nyc_restaurants subfolder containing the index snapshots.

Add nyc_restaurants dir to the path.repo variable in elasticsearch.yml in the <path_to_elasticsearch_root_dir>/config/ folder. See example here.. Restart elasticsearch for the change to take effect.

With Docker, any changes to your Elasticseach's configuration will be lost after a restart of the Docker container.

One solution is therefore to edit our Docker image before re-running it, then you can apply the changes mentioned above.

Register a file system repository for the snapshot (change the value of the "location" parameter below to the location of your restaurants_backup directory)

curl -H "Content-Type: application/json" -XPUT ‘http://localhost:9200/_snapshot/restaurants_backup' -d ‘{

"type": "fs",
"settings": {
    "location": "<path_to_nyc_restaurants>/",
    "compress": true,
    "max_snapshot_bytes_per_sec": "1000mb",
    "max_restore_bytes_per_sec": "1000mb"
}

Process and load data using a Python script

example script

Starting Elasticsearch (video)
Introduction to Kibana (video)
Logstash Starter Guide (video)
Elastic Cloud (video series)
Non-codelabs-friendly old tutorial