Recherche : [ETL_&_big_data_streams_processing]

The log/event processing pipeline you can't have - apenwarr

Let me tell you about the still-not-defunct real-time log processing pipeline we built at my now-defunct last job. It handled logs from a large number of embedded devices that our ISP operated [...] Eventually our team's log processing system evolved to become the primary monitoring and alerting infrastructure for our ISP.

you mostly get told that you shouldn't be using unstructured logs anyway, you should be using event streams.
That advice is not wrong, but it's incomplete.

There's a file called /dev/kmsg which, if you write to it, produces messages into the kernel's buffer. Let's do that! For all our messages!

RAM is even more volatile than disk, and you have to reboot after a kernel panic. So the RAM is gone, right? Well, no. Sort of. Not exactly.

have the client to stream logs to the server. This is possible using HTTP POST and Chunked encoding,

The log uploader uses a backoff timer so that if it's been trying to upload for a while, it uploads less often. However, the backoff timer was limited to no more than the usual inter-upload interval.

Someone probably told you that log messages are too slow, or too big, or too hard to read, or too hard to use, or you should use them while debugging and then delete them. All those people were living in the past and they didn't have a fancy log pipeline. Computers are really, really fast now. Storage is really, really cheap.

How much are you paying for someone to run some bloaty full-text indexer on all your logs, to save a few milliseconds per grep?

logs · streaming · SystemDesign · HTTP · Internet · ETL_&_big_data_streams_processing · kernel · grep · backoff

February 22, 2019 08:24:06 AM GMT+01:00 · permalien

·

https://apenwarr.ca/log/20190216

Apache NiFi

Prog · ETL_&_big_data_streams_processing

July 20, 2015 07:52:20 AM GMT+02:00 · permalien

·

http://nifi.apache.org/

streamtools: a graphical tool for working with streams of data | nytlabs ← Research, thoughts, and process from The New York Times R&D Lab

We see a moment coming when the collection of endless streams of data is commonplace. As this transition accelerates it is becoming increasingly apparent tha...

Prog · ETL_&_big_data_streams_processing

May 31, 2015 06:56:56 PM GMT+02:00 · permalien

·

http://blog.nytlabs.com/2014/03/12/streamtools-a-graphical-tool-for-working-with-streams-of-data/

Talend ETL - Forge Tutorials

Take a quick tour of the Talend products and discover some basic and popular features. Talended experts are sharing experience with you by these tutorials!

Prog · ETL_&_big_data_streams_processing

May 31, 2015 06:56:33 PM GMT+02:00 · permalien

·

https://www.talendforge.org/tutorials/menu.php

Models and issues in data stream systems

CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.

Prog · ETL_&_big_data_streams_processing

June 24, 2014 07:55:33 PM GMT+02:00 · permalien

·

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.9846&rank=4

Finding Frequent Items in Data Streams

CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.

Prog · ETL_&_big_data_streams_processing

June 24, 2014 07:55:13 PM GMT+02:00 · permalien

·

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.8304&rank=3

Kafka: Distributed Messaging System for Log Processing

Prog · ETL_&_big_data_streams_processing

June 24, 2014 07:54:22 PM GMT+02:00 · permalien

·

http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

textbook: Data Streams: Models and algorithms

Prog · ETL_&_big_data_streams_processing

June 24, 2014 07:53:44 PM GMT+02:00 · permalien

·

http://charuaggarwal.net/streambook.pdf