Big Data / IoT

You might have had inspiration to create a new database, and here it comes — a tutorial for creating a new time series database.

Performance is always critical, most of the speeches and talks are dedicated to improving performance of the jvm/.net runtime or sophisticated MPI algorithms. Nowadays, it usually comes to the fact that we need to improve the performance of a big data infrastructure, and most of the components provide various metrics and properties for fine tuning, and here are 5 boosters for Apache Hive.

An interesting success story about integration of Kafka-connect with HDFS from engineers of the biggest internet radio station: Pandora Kafka-connect

‘Data is the new oil’ was the hype few years ago, industries are adopting new technologies and starting the business value extraction from data they have collected, there are several examples from the airline industry about what might be done.

Machine Learning

If you have ever tried machine learning nowadays you have heard about deep learning and how much time it takes to train the model. With GPU hardware the training process performance has improved, however it does not allow much scaling. Typically you are bounded by hardware provided by IaaS or your own PC. To overcome that we usually start looking for distributed learning capabilities. One of the possible ways to do this is to use juju+kubernetes and a distributed tensorflow setup. A great write up on how to do this can be found on Samuel’s blog.

There are number of frameworks available for research and production level machine learning. Some provide distributed learning facilities, some not, and the brand new Caffe2 library has great capabilities for distributed learning. Here is a link to its GitHub page. Caffe is a machine learning framework created by the Berkeley Computer Vision department. It was previously possible to do, but not so convenient, and required working with forks like intel/caffe: link. Now things have changed and teams can dedicate efforts to ease the process of distributed training. You should give it a try and read a couple of tutorials on main web site.

Neural networks and deep learning are opening new horizons for those who are interested in generating artistic content using machines. With recent research in GAN style architectures a brand new approach called BEGAN has been invented. It enables you to create almost realistic face images using machine. There is article on the research and approach, with a few tensorflow implementations. You can try it by yourself and connect a video stream to the model and provide different faces for people.

One more example of neural networks and deep learning applications – photo style transfer! It allows having two images transfer the style from one to another.

Just in case, if somebody is interested in how translations are done by Google translator under the hood, here is an improved seq2seq framework written in Tensorflow and published this month.

Interesting article about Tensorflow’s distributed mode, which explains basic concepts of the cluster and highlights the differences comparatively to traditional long-lived Hadoop or Spark deployments.

Java Performance

You should know that the most performance is typically gained from the proper configuration of the garbage collector in your JVM. There are a number of various garbage collection algorithms, with various settings and different edge cases. However you might find it interesting to learn what might cause JVM GC to start memory cleaning, and there is a list of few mentioned here. You might learn something new about JVM from that article.

NO Jigsaw

Everyone is waiting for Java 9 to be released (countdown) and project Jigsaw is an already well known feature, but Java 9 going to bring us more fancy stuff. Especially I’d like to draw your attention to a presentation. This tells us about the changes in Java 9 GC logging and will show how pretty and parsable it will be. Also, compact Strings (JEP-254) are going to be released, which can reduce memory footprint by up to 15% and throughput up to 7% due to faster/less frequent GCs and better cache evaluation of these Strings. In any case it’s true for applications that are mostly using LATIN-1 🙂 Finally the third theme is about the new Stack walking API, which gives you the ability to safely walk through stack frames without a huge memory footprint. Don’t forget to watch the video till the end to see how to write your own Java just-in-time compiler quickly with the Java language!

How Java pattern matching may look: link.

Spring Cloud Pipelines 1.0.0.M4 released.