Hi Guys, So many people ask the same question time and again that I felt odd at first to start the discussion on it, but thought it will help others as well who will be in the same dilemma as me with so many options available So here are the options first- SQL (Relational Database) - SQL Server - PostGresSql - MySql
NoSql - Redis - MongoDb - Cassandra
FlatFiles - CSV - TeaFiles
TimeSeries Database - KDB+ - InfluxDb
This is what i m looking for- 1. Storing live streaming data to a database 2. Running a script that converts Tick data to OHLC and save the OHLC in same Database 3. Running one more script or coding in same script above to create one more type of Chart Pattern (Renko) and saving it in same database 4. Running one last script that will fetch data from same Database and trade using Zerodha trading API
Note- I will be using Python programming . So whichever database you suggest should have a python library Most Important- Database should be free and opensource
Latency- Doesnot matter much as it will be a Medium Frequency model ( 5 min- 1 hour) Scalability- Database should be able to handle tickdata for atleast 200 Equity Scrip or more
All suggestion on topic with reasonable explanations are Welcome. Any one who already is in the role and/or associated with any prop trading firm and wants to suggest something off topic is also Welcome.
Since your requirement is not latency critical and you're only covering a couple hundred instruments, Postgres or MariaDB (MySQL) should do you good. We store billions of rows in these databases.
If you want a true time series database, then checkout https://www.timescale.com (built atop Postgres)
i use KDB+ mostly for IOT(internet of things) data-very much like financial data . For streaming use case its the best in market. However it has a high learning curve. Very Fortran like language -q.But its robust and very stable in production scenarios.
@pinkpanther I have used KDB+ a lot while at an Investment bank. Now as a stand-alone HFT trader, I can't afford it. Do you work for a firm or has KX Systems slashed its prices? I am aware of 32 bit free version, but I don't think that would suffice for my needs.
Pls Suggest Best Way to Store Tick Data for All F&O Stocks and Convert Tick Data to OHLC Etc and Throw Out Updated Info for All F&O Stocks..
Note: 1. I Want Tick Data in Full Mode 2. I Would be Running Analysis on Tick Data and Post Which Should Throw Updated Info for all F&O Stocks
So Requesting To Kindly Suggest Better Way of Handling Such Large Tick data.. {Max Tick Data I Want to Store is for Only one day.. If Possible One Week Data Would be Great}
@Sudheer@sujith Anybody compared InfluxData(InfluxDb) with TimeScale or Artic for storing the Tick Data. Currently using InfluxData but want to know the opinion of others.
Hello everyone. Thanks for this great discussion. Has anyone here practically tried https://www.timescale.com/ as suggested by kailash above ?
Specially for storing and analyzing NSE Tick Data. Can someone please comment, if TimeScale database is well capable for such work or not ?
I am a private trader and I plan to use it for personal analysis. So will the community edition "FREE" will suit my needs, or I will need to look for the paid version for any proper tick data analysis ? - https://www.timescale.com/products
Any comments or suggestions are most welcome. Thanks a lot
PS: @techietrader since you are the original thread starter, please do update us about your final decision. Which particular database did you finally choose for this work and how did that perform ? Thanks a lot.
I am new to Kite Connect and am building my own system in Java. Just at the point of picking a Time Series DB for the Tick data. I looked at InfluxDB but realized that their Windows 10 support is still in Beta. And since I am working on a Windows 10 laptop at this time, I didnt want to invest the time and energy required for InfluxDB. Rather I am looking to get started with a immediate solution and then migrate the ticker data system to my linux box to operate independently.
Now looking at QuestDB, a pure Java system which seems to provide the features needed.
Any pointers? Any experience that you can share?
Data capture, even for a large number of tokens, should not be a problem. What I am looking for is data aggregation on the fly to build OHLC data sets for higher time frames from the tick data without having to fetch the data from DB to application layer for aggregation.
Another feature I am curious about is running indicator computation code directly in the DB. Anyone has any experience with that? And if you are running the computations in the application layer then what are you using? DIY or a TA library?
I'm starting new with websockets, while the requirements remain the same (as at the top of the thread) and had downloaded the data in .csv file format. But considering to move to a DB for multiple practical reasons. I see the comments are quite old. Is there a new suggestion to db as I found that https://www.timescale.com/ is no more free.. Any suggestions are welcome.. All my development are in Python..
I want to say that the right tool in the right context. You can use any database, but make sure that you choose the right one for your logic. you can use any open source DB SQL and NOSQL
If you want to do critical operations like quick trading, then try Redis. It is the best in-memory database in the world. It has vast options to do your operations without any hassles. Redis is NOSQL and very much faster than disk based databases like Postgres, MongoDB, Clickhouse, etc.
I can't write more because it is a big topic. As a retailer, our use is very limited when compared with institutions, so in the case of Zerodha, they use mostly Postgres and Redis. I think they just started with Clickhouse, which is SQL DB.
To conclude: in memory DB for critical operations, disk based DB for non critical operations. Selecting SQL or NOSQL is purely subjective.
Answering this in Sept 2024. I tried almost all of the options available. My choice is to use timescaledb (with compression enabled). Pretty fast and robust for storing ticks data. Dont need fancy servers too (I am running mine on a asus chromebox 3 with i7 processor, along with 16gb ram)
@ketansp Well, if you just want to store and sort tick data, then 1 GB of RAM with i5 or base variant processor is sufficient, but if you want to do many critical operations and heavy computing, then you should have high end. Why do we need to store tick data ? For trading purpose right so Good CPU and RAM are essential for trading. All databases are good, but your logic and context are important.
Timescale is a good DB based on Postgres SQL, but for trading purposes in memory database, and NOSQL is good so that you will know through the journey.
@ANL my apologies for not sharing the full context. I have sectioned it in 3 phases. 1. Hot phase - this is the realtime tick data and realtime processing that needs to happen on it (something like rolling it up into OHLCV candles or any such custom roll-up function.) This is stored in the application memory, which helps me manage an even driven way. 2. After every 10 seconds, I flush all collected ticks / candles to timescaledb. This is the worm phase. 3. After a period of 7 days, timescaledb compresses the data. This is the cold phase.
I need to store full tick data as I need to create custom roll-ups. Also, with tick data, the backtesting logic can be tested to the most granular levels.
Hope this helps. Happy to explain more if needed.
PS - I use redis pub-sub for realtime communication between different application threads. That also helps me in maintaining an even driven approach.
Ketan, this is interesting. I use redis for all 'hot' data and am using nodejs for converting them to candlesticks and so on. But I do feel the limitations of not keeping the candlesticks in a SQL. I wanted to ask what might be a good way to use timescaledb to ingest ticks and store them as candlesticks - and do you think we would require a queueing system or timescaledb can handle (with the right hardware), say tick data from 9K symbols (assuming we are maxing out kite limitations). The lowest timeframe we use is 1 min intervals in terms of candlesticks
@TradeTastic apologies for missing out your response. Let me try to answer it.
I have tried this setup with max capacity of 9k instruments' tick data. It works well on a fairly standard hardware.
1. Have 1 node.js thread to just subscribe to ticks from kite ticker instance. As soon as you receive a tick, format it, assign your system's instrument id and send it to pub-sub system. (I would not advise queueing system here because of latency). I use redis pub-sub with fire-and-forget approach. 2. All application threads where you need live prices, should simply subscribe to this pub-sub channel. You can one universal channel or have separate channels for a few select instruments. (having separate channels for all 9k instruments does not scale. I have separate channels only on the instruments which I actually trade.) 3. Have a node.js thread whose job is to solely collect ticks from this pub-sub channel and save it into database, every n seconds. In my experience, 30 seconds of flush interval is good for 9k instruments' tick data. Ensure you do not have any additional layers like ORM or formatting or something similar. This should be a simple bulk insert query where you should be able to insert lakhs of records in one go. You might have to tweak bulk insert limit of postgresql server for this. 4. Have another thread which consumes these ticks and converts them into candles. I have tried to do this part inside timescaledb and regretted because 1. lack of flexibility to define custom roll-up functions like renko, vollume bars, dollar bars and 2. lack of ability to execute this realtime. (there is no database which can create candles realtime, at scale, with single digit latencies. Trust me on this.)
@ketansp thank you so much for responding I've been doing it in a similar fashion, but did not use pub-sub - but now that you mention it, I am curious about the need for the pub-sub? Are you using it in lieu of a queue? What kind of future problems can going the pub-sub way eliminate for me - I seem to be doing it directly and haven't hit any (not sure if I have hit any invisible issues).
I want to thank you for the notes around timescaledb - I truly believed (after reading their docs) that they would be the holy grail. Quite like you said, even handle candle conversions .. but looks like for 9K symbols I have to forego that.
Though you write your ticks/candles into postgres - do you run 1 min calculations? If so, can postgres handle 1 min calcs at scale or you think redis itself can do it? Currently I am doing it on redis and its not the easiest thing, but don't know if postgres can handle it.. let me know! Thanks again for the insights on the latencies.. its one of the most helpful posts I've found here
To conclude: in memory DB for critical operations, disk based DB for non critical operations. Selecting SQL or NOSQL is purely subjective.
Postgres is not good for critical operations, which are disk based. You must consider any in memory database for your usecase. There are other factors that affect disk based DBs; there is a fundamental cost of accessing data from disk. First, they write in disk and read it from disc, and also the speed depends on what OS you are using; if you are using Windows, the disk based will be an overhead sure.
Just check with how Zerodha uses Redis in their usecases; they are managing billions of rows and keys per day with a delay of maximum 1 to 2 ms, according to Kailash.
There is no guarantee of data retrieval in a specific timeframe. For example, if you want to retrieve some keys in 10 ms, even if data indexing and retrieval logic are perfect, it will go beyond the level sure, but the memory database will give a minimum guarantee that it can average below 10 ms. What I mean is that the disk is handling IO operations harder than memory.
@TradeTastic pub-sub primarily solves the problem of communication when your application deployment might need to consume more than one servers. Till you are on a single server, inter-process communication can help. If you on a single application / monolith, you don't even need that.
@TradeTastic and @ANL even if you put all ticks / candles in redis and do minute level processing, that approach is not going to scale. Even for simplest of use case for 1 min candle in redis for 9k instruments, it means firing 9k queries at the same time. While redis will do this job much more efficiently than postgres, this pro-active approach is flawed (if one has to scale). It will scale only upto a couple hundred instruments max. (What Kailash Nadh talked about in talk is an example of dumb cache, where you do not need to do any additional processing and can be served to the user as is.)
Rather, think reactive / event-driven for your analysis. 1. If you need to create 1 min candle for all 9k instruments, there should be a single owner who should be worried about events which change a candle (new tick or rollover of minute candle). Only when this event happens, the thread should do the calculation; otherwise sit silently. While it might consume lot of information, it will produce comparitively smaller piece of data. Once a candle is formed, this thread should just broadcast to the world that this is a new candle for so instrument and forget about what others choose to do with it. 2. If you have a strategy which need 20 candles of 1 min each, then it is that thread's responsibility to keep last 19 candles in its memory and be on the lookout for the next candle. There should be no need for a db / redis trip. As soon as it receives the next candles from 1st thread, it should do its processing and broadcast the result to rest of world, without worrying about what they do about it. 3. and so on and on...
@TradeTastic and @ANL another reason why I am so keen on doing this in application memory layer, instead of redis, is because this is the layer which can be scaled very easily when you need to go horizontal.
@ketansp What I stated was the performance difference between Postgres and Redis. As said earlier, "the right tool in the right context." I don't need 9K instruments; even if I have just 10 instruments, that's enough, so in my case, Redis is better and more efficient. Also, I don't follow convectional methods. Of course, for 9K instruments, we have to use disk-based SQL or NoSQL databases. As of now, there is no other in memory database. Redis can scale up, but it should be done in a very accurate indexing manner; otherwise, it will be an overhead sure.
I think MongoDB is the best for managing 9K easily because it's a NOSQL and it has good data collection methods and high concurrency.
@ANL , for 9K, why do we really need disk based? Unless ofcourse you are holding historical data in memory - but if you took the ticks and processed them into candles, am sure you can hold all of the full day's or even 2/3 days worth of data in memory
@TradeTastic I would say 9K is a fantasy count. Actual liquidity/active tradable instruments are very very less than 1K, say 500 instruments. 9K is actually for who makes a candle chart or for doing complete market analysis for companies like broking or charting firms.
For 500 instruments, Redis is the best choice. As I said, I don't need candles; even if I do, I can calculate 1 min for just below the 80 ms average, which is the benchmark figure.
for 9K, why do we really need disk based?
Yes, you need to have disk-based in the future. In memory DB scaling is expensive compared with disk-based scaling because it uses RAM. I am saying this after doing all kinds of studies and experiments. For 500 instruments, the daily tick data size would be between 1.2 GB and 1.6 GB (L2), so if you are storing say 3 days of data, it would be 4.8 GB. In fact, most of the convectional methods, like candle or technical analysis, need at least 5 days of data to work without wrong values. So then for 5 days it would be around 8 GB to 10 GB. You can't get accurate TA values for 5 days; it is actually like 9 to 21, then it would be around 35 GB approximately. Now you can think about in memory DB, which is feasible for longer day trading. There are a lot more factors that I can't write about here. During your testing or research, your strategy needs more than 5 days in the future, then you must change the database, so changing the database is not an easy task unless you are a DB expert.
@ANL I totally understand where you are coming from. For upto 500 instruments, redis might work just fine. The 80ms benchmark that you have, again if it serves the use case, well and good. I have been able to achieve 5ms benchmark for 9000 instruments by doing all the processing in application memory.
@ANL your calculations for size of data required are totally accurate. However, you dont need to process all that data while building realtime systems. You just need to consume large amounts ticks but perform very tiny operation on top of it. You dont even need a db for this. Can be handled in the application layer itself. One should store all the ticks in disk based db, not for realtime analysis but for historical / backtesting.
@TradeTastic just imagine that you dont have any db for realtime executions and design your system completely in application layer to keep latencies minimum. For backtesting, dump all ticks in a db, does not matter which, and while analysing fetch the ticks and re-apply the candle formation logic.
I have been able to achieve 5ms benchmark for 9000 instruments by doing all the processing in application memory.
What programming language are you using? I am using Python. 5 ms is very good, but there are some other issues that may affect if we are only depending on the application layer. Say you have calculated the candle information and then again need market data retrieval. In this case, I think some issues may crop up in the future. That's why I don't use the application layer for my purpose. I completely depend on Redis. What I said about 80 ms is just calculating the candles with persistence enabled, so if I removed persistence, then I would get it around average 32 ms.
My current setup is faster and does more computation than normal candles, so it just takes 1.29 ms for 100 instruments with the use of Redis. I don't retrieve all the data; I just need to retrieve the data one time, so it saves a lot of time.
If you want a true time series database, then checkout https://www.timescale.com (built atop Postgres)
Pls Suggest Best Way to Store Tick Data for All F&O Stocks and Convert Tick Data to OHLC Etc and Throw Out Updated Info for All F&O Stocks..
Note: 1. I Want Tick Data in Full Mode
2. I Would be Running Analysis on Tick Data and Post Which Should Throw Updated Info for all F&O Stocks
So Requesting To Kindly Suggest Better Way of Handling Such Large Tick data.. {Max Tick Data I Want to Store is for Only one day.. If Possible One Week Data Would be Great}
Kindly Suggest
Hello everyone. Thanks for this great discussion.
Has anyone here practically tried https://www.timescale.com/ as suggested by kailash above ?
Specially for storing and analyzing NSE Tick Data. Can someone please comment, if TimeScale database is well capable for such work or not ?
I am a private trader and I plan to use it for personal analysis. So will the community edition "FREE" will suit my needs, or I will need to look for the paid version for any proper tick data analysis ? - https://www.timescale.com/products
Any comments or suggestions are most welcome.
Thanks a lot
PS: @techietrader since you are the original thread starter, please do update us about your final decision. Which particular database did you finally choose for this work and how did that perform ? Thanks a lot.
I am new to Kite Connect and am building my own system in Java. Just at the point of picking a Time Series DB for the Tick data. I looked at InfluxDB but realized that their Windows 10 support is still in Beta. And since I am working on a Windows 10 laptop at this time, I didnt want to invest the time and energy required for InfluxDB. Rather I am looking to get started with a immediate solution and then migrate the ticker data system to my linux box to operate independently.
Now looking at QuestDB, a pure Java system which seems to provide the features needed.
Any pointers? Any experience that you can share?
Data capture, even for a large number of tokens, should not be a problem. What I am looking for is data aggregation on the fly to build OHLC data sets for higher time frames from the tick data without having to fetch the data from DB to application layer for aggregation.
Another feature I am curious about is running indicator computation code directly in the DB. Anyone has any experience with that? And if you are running the computations in the application layer then what are you using? DIY or a TA library?
All my development are in Python..
You can try out clickHouse. We use the same in our chart backend as data ware house.
Choosing good DB,
I want to say that the right tool in the right context. You can use any database, but make sure that you choose the right one for your logic. you can use any open source DB SQL and NOSQL
If you want to do critical operations like quick trading, then try Redis. It is the best in-memory database in the world. It has vast options to do your operations without any hassles. Redis is NOSQL and very much faster than disk based databases like Postgres, MongoDB, Clickhouse, etc.
I can't write more because it is a big topic. As a retailer, our use is very limited when compared with institutions, so in the case of Zerodha, they use mostly Postgres and Redis. I think they just started with Clickhouse, which is SQL DB.
To conclude: in memory DB for critical operations, disk based DB for non critical operations. Selecting SQL or NOSQL is purely subjective.
It has been quite sometime being on the above stack for kite-charts backend.
Timescale is a good DB based on Postgres SQL, but for trading purposes in memory database, and NOSQL is good so that you will know through the journey.
1. Hot phase - this is the realtime tick data and realtime processing that needs to happen on it (something like rolling it up into OHLCV candles or any such custom roll-up function.) This is stored in the application memory, which helps me manage an even driven way.
2. After every 10 seconds, I flush all collected ticks / candles to timescaledb. This is the worm phase.
3. After a period of 7 days, timescaledb compresses the data. This is the cold phase.
I need to store full tick data as I need to create custom roll-ups. Also, with tick data, the backtesting logic can be tested to the most granular levels.
Hope this helps. Happy to explain more if needed.
PS - I use redis pub-sub for realtime communication between different application threads. That also helps me in maintaining an even driven approach.
I have tried this setup with max capacity of 9k instruments' tick data. It works well on a fairly standard hardware.
1. Have 1 node.js thread to just subscribe to ticks from kite ticker instance. As soon as you receive a tick, format it, assign your system's instrument id and send it to pub-sub system. (I would not advise queueing system here because of latency). I use redis pub-sub with fire-and-forget approach.
2. All application threads where you need live prices, should simply subscribe to this pub-sub channel. You can one universal channel or have separate channels for a few select instruments. (having separate channels for all 9k instruments does not scale. I have separate channels only on the instruments which I actually trade.)
3. Have a node.js thread whose job is to solely collect ticks from this pub-sub channel and save it into database, every n seconds. In my experience, 30 seconds of flush interval is good for 9k instruments' tick data. Ensure you do not have any additional layers like ORM or formatting or something similar. This should be a simple bulk insert query where you should be able to insert lakhs of records in one go. You might have to tweak bulk insert limit of postgresql server for this.
4. Have another thread which consumes these ticks and converts them into candles. I have tried to do this part inside timescaledb and regretted because 1. lack of flexibility to define custom roll-up functions like renko, vollume bars, dollar bars and 2. lack of ability to execute this realtime. (there is no database which can create candles realtime, at scale, with single digit latencies. Trust me on this.)
I want to thank you for the notes around timescaledb - I truly believed (after reading their docs) that they would be the holy grail. Quite like you said, even handle candle conversions .. but looks like for 9K symbols I have to forego that.
Though you write your ticks/candles into postgres - do you run 1 min calculations? If so, can postgres handle 1 min calcs at scale or you think redis itself can do it? Currently I am doing it on redis and its not the easiest thing, but don't know if postgres can handle it.. let me know! Thanks again for the insights on the latencies.. its one of the most helpful posts I've found here
Just check with how Zerodha uses Redis in their usecases; they are managing billions of rows and keys per day with a delay of maximum 1 to 2 ms, according to Kailash.
There is no guarantee of data retrieval in a specific timeframe. For example, if you want to retrieve some keys in 10 ms, even if data indexing and retrieval logic are perfect, it will go beyond the level sure, but the memory database will give a minimum guarantee that it can average below 10 ms. What I mean is that the disk is handling IO operations harder than memory.
@TradeTastic and @ANL even if you put all ticks / candles in redis and do minute level processing, that approach is not going to scale. Even for simplest of use case for 1 min candle in redis for 9k instruments, it means firing 9k queries at the same time. While redis will do this job much more efficiently than postgres, this pro-active approach is flawed (if one has to scale). It will scale only upto a couple hundred instruments max. (What Kailash Nadh talked about in talk is an example of dumb cache, where you do not need to do any additional processing and can be served to the user as is.)
Rather, think reactive / event-driven for your analysis. 1. If you need to create 1 min candle for all 9k instruments, there should be a single owner who should be worried about events which change a candle (new tick or rollover of minute candle). Only when this event happens, the thread should do the calculation; otherwise sit silently. While it might consume lot of information, it will produce comparitively smaller piece of data. Once a candle is formed, this thread should just broadcast to the world that this is a new candle for so instrument and forget about what others choose to do with it. 2. If you have a strategy which need 20 candles of 1 min each, then it is that thread's responsibility to keep last 19 candles in its memory and be on the lookout for the next candle. There should be no need for a db / redis trip. As soon as it receives the next candles from 1st thread, it should do its processing and broadcast the result to rest of world, without worrying about what they do about it. 3. and so on and on...
I think MongoDB is the best for managing 9K easily because it's a NOSQL and it has good data collection methods and high concurrency.
For 500 instruments, Redis is the best choice. As I said, I don't need candles; even if I do, I can calculate 1 min for just below the 80 ms average, which is the benchmark figure. Yes, you need to have disk-based in the future. In memory DB scaling is expensive compared with disk-based scaling because it uses RAM. I am saying this after doing all kinds of studies and experiments. For 500 instruments, the daily tick data size would be between 1.2 GB and 1.6 GB (L2), so if you are storing say 3 days of data, it would be 4.8 GB. In fact, most of the convectional methods, like candle or technical analysis, need at least 5 days of data to work without wrong values. So then for 5 days it would be around 8 GB to 10 GB. You can't get accurate TA values for 5 days; it is actually like 9 to 21, then it would be around 35 GB approximately. Now you can think about in memory DB, which is feasible for longer day trading. There are a lot more factors that I can't write about here. During your testing or research, your strategy needs more than 5 days in the future, then you must change the database, so changing the database is not an easy task unless you are a DB expert.
@ANL your calculations for size of data required are totally accurate. However, you dont need to process all that data while building realtime systems. You just need to consume large amounts ticks but perform very tiny operation on top of it. You dont even need a db for this. Can be handled in the application layer itself. One should store all the ticks in disk based db, not for realtime analysis but for historical / backtesting.
@TradeTastic just imagine that you dont have any db for realtime executions and design your system completely in application layer to keep latencies minimum. For backtesting, dump all ticks in a db, does not matter which, and while analysing fetch the ticks and re-apply the candle formation logic.
My current setup is faster and does more computation than normal candles, so it just takes 1.29 ms for 100 instruments with the use of Redis. I don't retrieve all the data; I just need to retrieve the data one time, so it saves a lot of time.