elasticsearch statistics

Understanding social media engagement with Elasticsearch

A technical walk-through of our new engagement label feature

We’ve recently added a new Directory column that lets users understand how an influencer’s engagement rate compares to the wider network. You can read more about the feature in our technical product manager's blog.

In this article, we’re going to walk through how we built this feature from a technical perspective, with the help of Elasticsearch’s Percentile Aggregation feature.

Step 1. Calculating engagement rate

We defined the engagement rate calculation for a single profile by totalling the engagement (likes + comments + retweets etc) for all the profile's posts then dividing by the total number of followers and dividing that by the number of posts (to get the per post average):

Engagement Rate Per Post = Total Engagement / Follower Counts / Number of Posts

We only consider posts made within the last 30 days in the calculation and also ignore posts under 12 hours old to ensure sufficient time to collect engagement. Additionally, we don’t calculate the engagement rate for profiles with follower counts below 300.

Note here that we’re using follower counts from today, which may return a slightly deflated engagement score if a profile is rapidly building followers. We are exploring ideas to address this.

Step 2. Reindex all profiles with the calculated engagement rate

The next task was to populate our Elasticsearch index with the engagement rate for each profile. Firstly, we updated our mapping to include a new engagement_rate field, stored as double-precision floating point (double). Since we knew that we were going to be performing aggregations on the fields, we ensured the field used doc values to minimise heap pressure. Note that doc values are the default on all fields excepted analysed strings since Elasticsearch version 2.x, so you most likely don't have to do anything here. However, we’re still running 1.7 (we have plans to update soon though).

> curl http://scrunch-es-server:9200/profiles/_mapping?pretty=1
"profiles" : {
"mappings" : {
"profile": {
## ... other fields
"stats": {
"properties": {
"engagement_rate": {
"doc_values": true,
"type": "double"

With the new field in place, we were ready to kick off our reindexing process. We reindex our dataset by spinning up a couple of extra worker nodes, each running an indexing service, which fetch raw data from HBase, do some processing and index the data into Elasticsearch. This process currently takes us a couple of days to run across our entire data set.

elasticsearch social media engagement

Step 3. Analysing the percentiles values per network

Once the engagement rate field was populated across our index, we were ready to understand how engagement rate percentiles compared across networks.

Percentiles are a statistical measure that can be used to understand how a single data point stacks up against an entire population. A data point said to be at the "80th percentile" is at the point where 80% of the population falls below and therefore is in the top 20%. 80 is said to be the data point's “percentile rank”. 

Percentiles are usually calculated by sorting all data points and finding the data points at percentiles of interest. Elasticsearch conveniently provides an aggregation that can do this work for you called the Percentiles Aggregation. Since it is infeasible to sort large data sets spread across multiple nodes, Elasticsearch uses a percentile approximation algorithm called TDigest, which sacrifies some accuracy for memory saving. Elasticsearch also provides a compression parameter that let's you balance the memory/accuracy trade off.

The Percentiles Aggregation accepts afield parameter, which specifies the field to aggregation on, and a values parameter, which specifies the percentile values to return (by default, the 1st, 5th, 25th, 75th, 95th and 99th percentile values are returned).

Here’s an example of calculating the percentile rank for all our Twitter profiles: 

> curl http://scrunch-es-server:9200/profiles/_search?size=0&pretty=1
  “query”: {“filtered”: {“filter”: {“term”: {“network”: “twitter”}}}},
  "aggs": {
     "twitter_engagement_rate_percentile": {
        "percentiles": {
            "field": “engagement_rate",
            "values": [25, 50, 75, 95, 99]
# ..
 "aggregations" : {
   "twitter_engagement_rate_percentile" : {
     "values" : {
       "25.0" : 1.6669449849440765E-4,
       "50.0" : 5.898961082048358E-4,
       "75.0" : 0.0018051943177894237,
       "95.0" : 0.008704529104503865,
       "99.0" : 0.027440084392766538

Looking at the first entry in the response, we can see 25th percentile engagement rate for Twitter is around 0.00017 or 0.017%, and the 99th percentile is around 0.027 or 2.7%. Contrasting that with Instagram, where the 25th percentile value is around 1.4% and the 99th percentile value is 17.8% - engagement rates appear to be much higher on Instagram.twitter vs instagram engagement

Step 4. Labelling our profiles

With the percentile data collected for each network, we saved cache copies of them that were made available to our web server. Then, we wrote some simple code to assign each profile’s engagement rate into the following buckets:

  • 0 - 30%: Low
  • 30% - 60%: Good
  • 60% - 85%: High
  • 85%+: Very High

The buckets were selected by our in-house influencer marketing experts based on their field experience. We plan to tune these over time.

 And that's that!