Research & Development

Probabilistic Data Structures and Algorithms for Big Data Applications
A technical book about popular space-efficient data structures that are extremely useful in modern big data applications.

Implementing a Fileserver with Nginx and Lua
Using the power of Nginx it is easy to implement quite complex logic of file upload with metadata and authorization support, and without need of any heavy application server. In this article you can find the basic implementation of such Fileserver using Nginx and Lua only.

Probabilistic data structures. Quotient filter.
In this article we continue our acquaintance with implementations of probabilistic sets and consider a modern successor of the Bloom filter that is called Quotient filter. Such data structures can effectively work in situations when we need to handle billions elements and have optimized memory access.

Probabilistic data structures. Bloom filter.
In the article we consider such popular implementation of a probabilistic set as Bloom filter, that can efficiently solve the problem of determining membership of some element in a large set of elements without need to store every element and use many comparisons.

Automatic terms extraction for Domain-specific corpora
Using simple frequency-based methods, such as Domain Specificity method and Domain-Specific TF-IDF, it is possible to automatically extract and score terms for given domain-specific corpus. In this article we will use Python and its ecosystem to illustrate such methods in action.

Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015 took place September 28-30, 2015 in Budapest, Hungary. In this presentation I want to highlight some interesting presentations in more details.

A Simple Way to Find Turning points for a Trajectory with Python
Using Ramer-Douglas-Peucker algorithm we construct an approximated trajectory and find "valuable" turning points.

A Simple Way to Find Outliers in an array with Python
Using a basic definition of an outlier we provide a simple Python function to detect such values and highlight them on a plot.

Twitter analysis for Strata+Hadoop World (BCN, 2014) with Apache Spark and D3
Using the official hashtag #StrataHadoop, we've made a basic analysis of Twitter activity during the Strata+Hadoop World conference that was held on 19-21 November 2014 in Barcelona, Spain.

Realtime Twitter Sentiment Analysis with Storm and Elasticsearch
In this article we have built an Apache Storm topology to process Twitter stream and provide basic sentiment statistics based on the Stanford CoreNLP.