hello, Is there a code snippet to refer to for converting market tick via websockets to 15min candle(ohlc) data to store it in multiple mysql tables for multiple scrips. (in python if possible)
2) what is the average time of response for bulkapi with 200 scrips, is it possible to poll this api every 15 min? what is the maximum number of bulkapi calls available for every hour?
Is there a code snippet to refer to for converting market tick via websockets to 15min candle(ohlc) data to store it in multiple mysql tables for multiple scrips. (in python if possible)
We don't have sample code to generate candles. One of the ways is to create 15minute candles using 15 1minute candles. To create a minute data, you need to listen to all the ticks that come in and compute candles as follows, Open - > LTP of first candle. Close -> LTP of last candle. High -> highest LTP of that minute. Low -> lowest LTP of that minute.
2) what is the average time of response for bulkapi with 200 scrips, is it possible to poll this api every 15 min? what is the maximum number of bulkapi calls available for every hour?
There are no special restrictions on Bulk fetch API call. All HTTP calls to Kite Connect are capped at 3 requests per second. You can make any number of requests to bulk fetch API just make sure you don't cross the 3 requests per second limit.
I am writing the data to SQL table every 10 sec from WebSocket, for a 15min candle, keeping open = 1st ltp, close = last ltp, changing high and low with 'if condition' without using 1min candles Is it efficient? 2) does API support "Bracket order with trigger entry" as of now?
1) I think you should write to local memory based dictionary and take the dictionary at EOD to SQL. That is ofcourse if your RAM and/or virtual memory configuration can hold the datastructure of your instruments in memory.
2) BO with trigger, I am guessing you mean SL-M or SL based entry. Thats something that they are planning to do in couple of weeks in Kite API 3.0
The best way to do this is write your ticks to an in memory database like redis and then use something like python pandas to compute the candles from the tick data.
This involves some learning curve but is superfast though you will be limited by memory if you are doing it for more than a few hundred instruments simultaneously.
Time permitting; I may make a detailed blog post on this about a month down the line.
Instead of redis, you can use mongodb since it supports native object storage and you do not need to convert pickled / str data in redis to python object or what ever programming language you use. To speed up operations you can configure mongodb to run off an small capacity ssd say 128gb. This should be able to handle a few days worth of tick data. You can always back up to a secondary nosql / rdbms running off a regular HDD every few days if you want to preserve older tick data for backtest or any other reason.
@MAG: Did you tried both "redis" and "mongodb" ? If someone dont need to store tickdata but 5 mins/15mins ...data then which one most useful? also did you prepared any blog about the procedure of implementation?
I have tried both. Redis is obviously faster to insert and retrieve data since its in memory. But what many people do not speak about is that redis converts the data to a utf-8 encoded str. When you read data back in from redis, you need to convert this str back into a python object using eval or json.loads and that kills whatever efficiency you gain by using redis. On the other hand if you use mongo, its a bit slower than redis but when you read data back from mongo, it is already a python dict object. Here are results from my tests
Read data from redis and write back to redis as str without converting to python object for 49k full ticks: 25 seconds (but its useless as you need to convert to python object).
Read from redis, do eval or json.loads and write dict to mongo : 7.5 minutes for same data. (Eval or json.loads adds 7 minutes to the flow above)
Read from mongo and write to mongo for the same data on normal HDD : 2.5 minutes
Read from mongo and write to mongo for the same data on NVME SSD : 1.4 minutes
Note here that 49K ticks were all the ticks for a particular instrument for one day from 9:15 am to 3:30. And that took just 2.5 minutes to read and write on mongo using a slow hdd. Which means you could have data processed and converted into candles in near realtime.
In fact if i am generating 1 min candles. I get my candles formed within 1 second of the minute rollover. Whereas some of the popular charting tools available take up to 20-30 seconds for updating the candle after minute rollover.
No blog yet. But looking at how people are lost, I think I will make one. And also make a public github repo with code samples. It will take some time though as I need to balance other demands on my time.
@MAG Hi what is your opinion of postgresql, and can you suggest where can I learn about using WebSocket data , perform operations on it and store retrieve place order ect. Also how does one put the stream into a python dict or object for analysis what is the limit of python variable compared to databases . Thanks
@archulysses I havent tried postgres. In fact I haven't used postgres for years now. However I did read up on it and it seems it supports bson and the performance is pretty good. However whether you want to use postgres or mongo would depend on your final solution and how you want to query the data. For eg. in mongodb it seems you can search based on nested dicts or lists. You cant do that in postgres it seems. Thats from what I have read. I wil repeat that havent used postgres at all. As far as mongo is concerned, it has some good features and scores over redis in terms of usability.
As for your query of putting a websocket data stream into a python dict and doing analysis on it, I will be posting some info soon. For python dicts, basically its an issue with how much memory you have. If you are doing one or two stocks and intraday, you can easily keep it in memory without using an external db. But if the script crashes or terminates, your data is lost. Even if you restart the script, you cant pick up from where you left off from. So always advisable to use a persistent external store.
MAG, Great Post. This is definitely one of the biggest problems the community is currently facing.
I understand that you are getting ticks in dicts and relying on pandas to form candles and subsequent calculations/analysis. Howevever I have faced issues with conversion of dicts to pandas and found it slow especially if I want to work with 200+symbols simultaneously. It may take anything between 1-2 seconds which I am not happy about.
Any observations/suggestions here?
My Experiements on the topic:
1. Now-a-days we have excellent time series databases and I have experimented with InfluxDb. The influxDB is designed with this exact usecase: keep storing incoming real-time time-series data rapidly and then reduce the frequency of data (i.e. make 1 min or whatever period candles automatically at end of each minute) and store it. It does a remarkable job of storing whatever Kite throws up without losing sweat. The problem is retrieval. Documentation clearly mentions that it needs SSD and high RAM but I am using traditional disk with 8 GB RAM on i3. On my machine, it takes a 2-3 seconds to load data in pandas. Unlike traditional databases which may take time to understand and implement, it is extremely simple to implement. It has a python library and all database capabilities are exposed there. A beginner can get it up and running in 2-3 days. But like I said, the latency is loading data in pandas is on higher side.
This is similar to mongo db solution however we have replaced mongodb with influx db which is designed for this specific use case, and forms candles on its own. Being a db, persistence is taken care of.
2. If you are happy with candles in form of python dict, there is a simpler and faster solution.Simply form a dict for your ticker yoursefl with 1 min timestamps as keys (aka 915, 916, 917) and relevant data in values as namedtuple/nested dict: (open=10, high=12, low=8, close=10, volume=1111). When a new set of data comes from websocket, loop through the incoming data and for your symbol, pick up the time stamp of the incoming data, strip out the minute part and using that minute as key, update the params in the dict. If there is no data for opn in the minute, update all fields with the new tick. If there is existing data and is higher/lower then high/low fields in the dict, update them as well. Finally, always update the close with new tick. So you have realtime candles. Say you are at 920 min and when the timestamp of new tick has minute part as 921, it will automatically switch to 921 candle. Working with 700 tickers, it has worked realtime till EOD. Further you have the live candles while they are being formed, rather than forming candle only once at the end of the minute. For those who are willing to put the efforts they may get realtime value of the technical indicators by adding more fields to namedtuple and populating them realtime by inserting required formulae in the loop.
Further, if you look at Zerodha client library for Websocket, what it does is it receives a big blob of data from zerodha server, which contains data for multiple tickers. It then loops over the each row to extract data in form of dict and throws a list of dicts to user. The candle formation algo (as per 2) can be appended to the loop in Zerodha's client library to save additional time.
The issue is that while you have realtime candles in dict, when you want to manipulate data in array like dataframes or numpy then conversion of dicts to these structures eats 1-2 seconds and more as day progresses/no of tickers increases, which again is on higher side.
If someone can point to a in-memory data structure which can be used by both dict and pandas (with blazing fast retrieval in pandas) then maybe this can turn up as simple and viable option for processing in pandas. For those who are willing to live with a few seconds of delay in getting candles in pandas, this can be easily implemented.
There is no persistence here but this being a home grown solution it can be added.
I have found both solutions as easy and practical subject to latency in getting candles in pandas.
@sauravkedia Quite useful information.Your posts are always very informative. Instead of starting a new thread I am joining this thread with my queries as these are quite relevant up to candle generation and storing data. I have a question on the ongoing discussion on choice of an appropriate DB. My requirement is very limited , I want to subscribe to only 3 or 4 tokens. I plan to use the following work flow: - 1. Collect ticks between T minutes to T+ delta minutes , where delta is the resampling frequency. T & delta are integer. 2. Convert the ticks to pandas DataFrame with timestamp as index and then resample to get the High and Low of the candle. 3. Process the strategy with the help of pandas tools. Execute the strategy 4. Write to csv files (one for each token) at regular intervals say every 30 minutes. I do not want to query CSV. It will be called for loading data only once at the start. 5. Instead of using multithreading I intend to use redis as message broker and celery as task manager.
Q1.Does this appear to be OK for my type of requirement involving limited tokens? Use of pandas , redis - celery combo and csv... Q2. Since I plan to use "for tick in ticks" under on_tick method , will a code entrusted to celery for execution be executed on receipt of every tick? For example , if my candle frequency is 15 minutes, the relevant re-sampling code need not be executed on receipt of every tick. Hope to receive your valuable comments.
Thanks @RP3436. I am trying to get a sense of what you are trying to do. However, the devil in details and I am not able to grasp a complete picture. To answer in general:
a. Simple is better then complex. A simpler algorithm is better than a complicated one. Your algo of collecting ticks and then re-sampling at the end of the delta sounds simple and practical and something which can be rolled out rather quickly.
b. Re-sampling data in pandas is performant. Getting real-time stream of data into pandas, as when data arrives, isn't. From what I understand, for your usecase requiring few tokens, this should be fine.
c. Another issue is latency. If you are operating with 1 minutes candles, you may desire to form the candles, run strategy and pass orders within 1-2 second of minute getting over. As your periodicity increases, you will be able to put with more time lost in calculations.
Overall, your idea is practical, something which large majority might already be doing. I suggest you go ahead and experiment with it.
d. On writing to csv files: I would imagine that if data is being written, it is envisaged that they may be read in future. Generally speaking, databases are more efficient here as they allow you to query even if its one time loading at startup. If databases means a significant learning curve, then CSV are great second choice. With csv, you may have to think on issues like directory hierarchy. If you need to load data by day then maybe you can organize data in Year>Month>day>ticker.csv structure or a simple ticker_yyyy_mm_dd.csv files. There, you have to be careful that csv presents some issues (minor ones) on datetime fields while being read on pandas. If CSV is the way to go, I suggest you look at IO related documentation on pandas to check if it can be easily replaced with more performant alternatives like feather or pickle.
e. Celery paired with redis is a great choice which I using quite happily. But I am not clear on how you are using it. if you put a celery task on_tick method it will be executed for every tick. I guess you want to run it when candle is fully formed. That should be easy to achieve. I am imagining you would have put some conditional logic in your code for app to decide that current candle is over and next one starts. It could be based on the your computer's clock like at 9.30: the 9.15 15-minute candle ends and next one starts and so on. You can embed this logic inside on-tick method and call celery task only when a candle ends. This should be easy to achieve once you put your mind to it.
@sauravkedia Great explanation. BTW , I am a researcher on NIFTY and an occasional trader. However , I decided to learn python about 3 months back and learnt a lot online and from forum discussions. My idea is to handle streaming data - be it stock data or tweets or anything from sensors/bacteriological/environmental etc.. So thanks again for replying to my queries. kind regards.
@sauravkedia : sorry for the late response to your reply on 27th feb. First of all I do not understand why it takes you 2 seconds to load data into pandas. Zerodha websocket library gives you a list of dictionaries. Each dictionary in the list is a tick for one individual instrument. Once you get the tick list, you should first break it down into its individual components per stock / instrument and then process it further. I guess you are loading the entire tick into a pandas dataframe and trying to process it using pandas. Thats the wrong way to go about it.
The way I have written my system
Script one gets the raw tick list and writes it to db. That's all it does.
Script two reads the ticks in sequence, splits it into individual stocks and updates the relevant individual stocks queue in db.
Script3 reads the ticks from the individual stocks queue in db, converts it to pandas dataframe, generates 1 min candles and writes the candle data as csv string to another queue in db.
At this stage I have one min candles ready to be used by any script in csv format that needs candles, whether it is testing, actual trading etc. All three scripts are running in parallel and the result is near real time generation of the candles in csv format. I have timed it. Say the clock turns over to 10:14:00 and at 10:14:01 or 10:14:02 I have the ohlc candle for 10:14
The entire process takes milliseconds. I do not understand why its taking two seconds for you. Its obviously to do with your logic. There is a vast scope for performance improvement. It has nothing to do with pandas. It's more to do with the way you are using pandas. Sorry there. I don't mean to put you down. That's not the intention.
Regarding inlfluxdb I looked at influxdb and ruled it out immediately as its a commercial product that needs a license. As an individual user, I cannot afford licenses and it does not make sense when there is a plethora of open source tools to choose from.
Your point no 2. about generating candles using in memory python dicts is totally off the mark as it is inefficient, complex to code and prone to logic errors whereas pandas is designed to do this very thing accurately and efficiently. It needs a lot of thought in the logic etc esp the performance improvement. It's taken me quite a bit of time ,effort and iterative development to get this far.
Here is a screenshot of my candle generation time for one stock that I ran just now. This is at the fag end of the day at 3:28 when it has collected 23878 individual ticks for ADANIENT
This includes retrieving each new tick as it comes in from the db, appending it to a list of dicts and then loading that data into pandas, generating the ohlcv, converting the pandas ohlcv into a csv string and then writing the 1 min candles in csv format to db. all in about .2 seconds avg. Thats still a hundred thousand microseconds too many. In fact as I was writing this answer I realised that I could probably shave off a few hundred microseconds on top of this as well as reduce the memory usage.
@RP3436 You can in fact skip celery too. Its not needed. if you are subscribing to just 4-5 instruments, all you need is python, the kiteconnect library, redis and pandas. The efficiencies are all in the way you code. Unfortunately writing code is many more times complex than even chess. Meaning there are even more ways of achieving the same result and as sauravkedia mentioned, the devil is in the details. Sometimes instead of looping over a data set repeatedly using a for loop, if one converts it into a dictionary and does a key lookup/ reference as showin in the following code snippet, one can reduce lookup times from 2 seconds to 3 ms. I did that for the instrumentslist - mapping instrumentid to symbol names.
try: dict['key'] = value #where key is instrumentToken and value is SymbolName except: <handle exception if key not found>
Convert it to a dictionary once in the morning, pickle it and load the pickle when needed.
@MAG Thanks for the time take to explain this. Your algo seems interesting enough, however I am not able to grasp it completely especially what exactly script two is doing when script one is already writing data into db. What is the concept of "Stock queue" here etc. Are you using multiple tables in db, one table per stock for script two? If so, and if say, you get 100 datapoints in dump, will writing 100 lines into db one by one be performant? Which db are you using. etc.
Your screeshot is exciting. If you can share a simple implementation (even if a psuedocode), it will be great otherwise we will get lost in same devil in details crack.
Additionally, Influxdb is free to use and I am using their free version, though they do have a paid enterprise version as well.
Further, I found the dict method easier to implement. Reason being that its pure python, and doesn't require one to deal with an external database (especially for those, who dont have experience in databases). The coding of ohlcv values is easy. Being dict, its super fast. But I would agree that one has to give some thought to getting a right structure for the dict. I used a nested structure like tickers['ACC']['2018-002-07 9:25:00']['open'].
@sauravkedia ,, script1 gets ticks from websocket. For V2 API it's a list of up to 200 stocks in each tick. Something like : tick = [ {"instrumentToken":1, time: 9:15:00.200, other data...}, {"instrumentToken":2, time: 9:15:00.200, other data...}, {"instrumentToken":3, time: 9:15:00.200, other data...}, {"instrumentToken":4, time: 9:15:00.200, other data...}, {"instrumentToken":5, time: 9:15:00.200, other data...}, ... ]
Script2 splits it into individual queues one per instrument token instrument1 = [ {"instrumentToken":1, time: 9:15:00.200, other data...}, {"instrumentToken":1, time: 9:15:00.400, other data...}, {"instrumentToken":1, time: 9:15:00.800, other data...}, {"instrumentToken":1, time: 9:15:01.200, other data...}, {"instrumentToken":1, time: 9:15:01.300, other data...}, {"instrumentToken":1, time: 9:15:01.500, other data...}, {"instrumentToken":1, time: 9:15:01.900, other data...}, {"instrumentToken":1, time: 9:15:02.150, other data...}, {"instrumentToken":1, time: 9:15:02.430, other data...}, {"instrumentToken":1, time: 9:15:02.790, other data...}, {"instrumentToken":1, time: 9:15:02.970, other data...}, ]
instrument2 = [ {"instrumentToken":2, time: 9:15:00.200, other data...}, {"instrumentToken":2, time: 9:15:00.400, other data...}, {"instrumentToken":2, time: 9:15:00.800, other data...}, {"instrumentToken":2, time: 9:15:01.200, other data...}, {"instrumentToken":2, time: 9:15:01.300, other data...}, {"instrumentToken":2, time: 9:15:01.500, other data...}, {"instrumentToken":2, time: 9:15:01.900, other data...}, {"instrumentToken":2, time: 9:15:02.150, other data...}, {"instrumentToken":2, time: 9:15:02.430, other data...}, {"instrumentToken":2, time: 9:15:02.790, other data...}, {"instrumentToken":2, time: 9:15:02.970, other data...}, ]
instrument3 = [ {"instrumentToken":3, time: 9:15:00.200, other data...}, {"instrumentToken":3, time: 9:15:00.400, other data...}, {"instrumentToken":3, time: 9:15:00.800, other data...}, {"instrumentToken":3, time: 9:15:01.200, other data...}, {"instrumentToken":3, time: 9:15:01.300, other data...}, {"instrumentToken":3, time: 9:15:01.500, other data...}, {"instrumentToken":3, time: 9:15:01.900, other data...}, {"instrumentToken":3, time: 9:15:02.150, other data...}, {"instrumentToken":3, time: 9:15:02.430, other data...}, {"instrumentToken":3, time: 9:15:02.790, other data...}, {"instrumentToken":3, time: 9:15:02.970, other data...}, ]
script3 takes the list of dicts per instrument, one at a time and converts it into ohlc as follows: import pandas. pd2 = pandas.DataFrame( instrument1 ) ticks = pd2.ix[:, [ 'ltp' , 'volume' ] ] priceOHLCV = ticks.ltp.resample( '1min' ).ohlc() candledata = priceOHLCV.to_csv() # converts the pandas dataframe candle data to csv format written to db which can be easily processed further. *still learning about pandas so maybe I can do this even more efficiently in the future.
" If so, and if say, you get 100 datapoints in dump, will writing 100 lines into db one by one be performant" I use a combination of mongodb and redis. There are places where each shines.
The point is I broke it down into individual steps and individual components.
The first goalpost or milestone was to see if I could get and save data in realtime
The second was to see if I could split it into individual instrument queues in realtime
The third was to see if I can get one instruments ohlc generated in near realtime.
Once you can do one, that means your logic is solid and then it's just a matter of whether the solution needs more compute resources aka, threads / cores and memory than your current system provides.
But if you cant even generate ticks for one instrument in near realtime, then the whole logic and dev effort is for naught. Currently I can do a handful in near realtime so my logic is solid. I haven't tried hundreds. Maybe my system will choke when that happens. In which case I can solve the computational issue separately:
Maybe set up a cluster of raspberry pi's.
Maybe get a nvidia 1080 gpu and use CUDA or OpenCL.
Or if I have money to spare, splurge on a threadripper 1950x with samsung pro nvme's and 128G mem.
Hell even get a bunch of cheap old phones and use them for compute if I am adventurous enough. :-)
@MAG Thanks for the nice explanation. Your suggestions are well understood and taken. I am using celery to take care of the scalability in future. Before switching to celery , I read comparisons between celery and other alternatives and found celery scoring reasonably high. But yes , for fewer scripts, we can use other methods advantageously. I collect all the ticks in main thread and entrust celery to do rest of the stuff - grouping (based on token) , resampling + analysis using pandas and writing to csv every "N" minutes. So I save only processed data (with TIs) and not the raw data(tick). However, if some latency is permissible, I think saving the data to a DB and then processing at will should be an option if simultaneous reading from and writing to the DB is taken care with the choice of DB. Which DB do you use?
To simulate live streaming data , I used a historical data set (1 minute candles) and put it in a for loop for releasing one data point randomly at random time interval. So that becomes my tick. So even on market off days , I could test my code on streaming data. This was very helpful to me.
Your suggestions on hardware part are quite interesting. May be this can be discussed in a separate thread in more detail. Regards
I have worked with this algo before and it works well. This algo is an obvious first choice: keep collecting ticks in form of dict, and when candles are required , convert to pandas and then use excellent pandas resampling to get candles. For a few tickers , this is goto approach and there is no need to look beyond it.
I am not putting too much attention to retention here as 1. it can be added easily as you have done and 2. It maynot be required as the candles can be fetched using Historical data api.
However, 1. I wanted to scale it up to few hundred tickers 2. Wanted to have the candles as they are being formed and not once at end of period, and 3. All this while keep all candles real-time/close to real-time. I had trouble scaling this up with existing infrastructure.
So I tried two alternate approaches: 1. InfluxDB and 2. forming candles real-time the way I have explained earlier. I am happiest with last algo and likely to stick to it.
@MAG : Thanks for sharing the information. It is really helpful. Can you please also share the number of instrument_tokens(symbols) you are working with so that we can have a sense of this realtime performance with regards to the number of symbols as well.
@sauravkedia : can you please put more light on your statement : "I had trouble scaling this up with existing infrastructure.". did you have trouble scaling it to 100/200/500/1000 tickers and what was the latency you observed?
@naz, I am creating live 1Min candles for over 400 instruments daily, from tick data. The tick data processing takes just few micro seconds. And the candles are ready to be used anywhere in my code. If you use python effectively at data structure level, you will be able to make it. No latency no overheads. Give it a try, go deeper into it, you will find a solution that suits your requirement. Stay away from "easy to use" but "resource intensive" libraries and features.
@Shaha Thanks for info. Just wanted to understand if you are using pandas/numpy or native python datastructure like dict/list only. Also, looks like you are creating candles on the go and not by first saving it to some database and then using a parallel script to create candles. Please correct me here.
"Regarding inlfluxdb I looked at influxdb and ruled it out immediately as its a commercial product that needs a license. As an individual user, I cannot afford licenses and it does not make sense when there is a plethora of open source tools to choose from. " Influxdb has got an open source version without licensing restriction as far as I can see also which is what I'm using. I have been using it for about a week now. Can you please confirm if there is really a license on it?
@Imran , hi imran how to convert ticks to candles if we have more then one scripts, as i'm trying with multiple scripts but the OHLC data are all mixed up
@sgsaran i have done that also, but still i'm not getting the result may be i'm doing something wrong, can you please given little more info on how to do that.
I am struggling on one part. The volume piece of it.
So every tick is 1 seconds apart. How do I correctly compute the total volume for the 1 minute candles formed using these 60 ticks in a minute. I mean which field should I use and keep accumulating over the 1 minute period?
@Debsush i hope it helps df.resample('5min')[volume].sum() where: df = convert the receiving websocket data to dataframe ['volume'] = get only the volume column from the dataframe df sum() = adds up all the volume in the desire timeframe
I am creating 15minute candlesticks using queue in on_ticks and adding the tick data into pandas_df for over 50+ scripts. Whenever the time in tick** >= the last index value + timedelta of 15minutes, I create a new row and add data to the same
At the end of the day, I find mismatches between my candles and candles available from Historical Data.
Is this normal to have discrepancies? Since Zerodha provides 1 tick/second data, isn't it possible that we are missing out on data from other ticks within the same second?
**For the time in tick, tried separate codes for "last_traded_time" and "timestamp" and found errors in both. Which time should I use for comparing/entering the tick in candles? last_traded_time or timestamp?
p.s I am not missing out on entering any ticks as I have a try: except loop and logging.exception in place. In addition, my code is running on google cloud and internet is also not an issue.
@MAG did you get a chance to create the blog post which you've mentioned here? Am asking/looking for it coz, I finally came to a point where there is a need for in-memory DB.
3 months back I started with Zerodha API and was using CSV, then moved to MySQL and was looking at other time series DB's, then I found with my tests SQLite file based DB is faster than other DB's. So have been using SQLite since couple of months now. However, now I see that I need to move to in-memory DB for tick, so that I can convert them to 1min candles faster to store them in SQLite DB. So now I have been looking at Redis and SQLite in memory DB. If Redis takes longer time to convert from Str to python object, then SQLite in-memory would be faster?
Edit: I have another query:
When you convert tick to candles at regular intervals from in-mem to another DB, to get latest candle info for your algo/code calculations, would you pull lets say 1 min data from DB and stuff/append with latest info from in-mem DB?
For eg: I have stored 1min candles from 09:15 to 09:26 but want to get latest info at 09:28, so pull data from DB until 09:26 and append the 2 latest candles (by sampling) from in-mem db?
@neel2323 Yes its normal to have few discrepancies in Historical data with the candles u prepare. It happens due to multiple reasons including different system time of zerodha servers and your server, or some missed ticks or methodology used to create candles. A good strategy will not get impacted by such small discrepancies. So relax, dont worry much about picture perfect matching with other platform candles and focus more on strategy part.
@kiteapi you seems to be having so many overheads of routing it through DB. As per my earlier posts and code posted in this thread, a simple dictionary structure can also act as sudo-in-Memory DB. Plus bonus is the candles retrieval access is also fast. Very Light weight Entry and retrival.
In other words... i prefer, tick --> candle dictionary --> DB instead of tick--> DB-->candles..
I just implemented what I had mentioned above tick ---> tickDB ----> OHLCcandles (this is one process which does it periodically) storing of candles. Another for fetch which gets the candles from OHLCcandles DB and stitches/concats the last bar from tickDB. After a lot of optimizations candle retrieval and concat is happening in 0.1454245 seconds (if I am constructing same number of candles from directly tickDB it is taking around 0.4545436 seconds). I try to avoid pandas as much as possible, but this above operations are all with Pandas.
Will try to implement what you have suggested here. Some questions:
1. I see you have mentioned in this thread you do http://ezeetrading.in/Articles/Candles_formation_from_tick_data_zerodha.html for about 400 instruments, how often do you save your 1min dictionaries to the DB the days data? 2. And to save the dictionary to the DB you use another thread? It can't be another python process as the dictionary will not be accessible. 3. Once you have dumped the 1min candles to the DB from dictionary, do you delete the key/value pairs for those entries copied to db? Or do you keep the candle dictionary throughout the day and let it grow in size in memory? If yes for 400 instruments, what is the size taken in memory? 4. Lets say for your code it needs data more than today, then you would get/fetch the data from your DB and append todays data from dictionary? If this is the case then, the new program which needs the data cannot access the dictionary, if not spawned as a different thread in the same process right?
Please bear with my questions, I am not a typical programmer (thanks to google and stackoverflow) I am sure I will have more questions.
@kiteapi 1) at end of day. for record keeping only. 2) u can easily access by making the variable global. 3) negligible, i run all my code with over 700 instruments and over 80 strategies on 1 GB Memory RAM AWS server, which comes under Free tier. 4) read previous days candles in the morning before market opens in our global variable.
its just the basic knowledge about programming and your preferred programming language, that makes task simple. Sometimes we overthink about "how to?" in wrong direction and makes things complex.
@Shaha thank you for the reply. I get around 5% errors in 15min candles OHLC at the EOD when I use the variable "timestamp" from the tick. However this error increases to around 7% when I take the "last_traded_time" value. This is for around 70+ stocks. What % of errors do you get in your code?
The reason these bother me are cos my technical parameters get affected specially in terms of crossovers
@Shaha So true, we do make things complicated n slow.
1. Ok. 2. Ok.
3. Wow! 700 instruments and 80 strategies and on 1GB. Nice! Ah! right, you are not saving tick data, but 1 min candles in dict, so it would not take much space, sweet! Good to hear the free tier is able to handle 700 instruments and 80 strategies. Nice.
4. As you are importing 1min candles for 700 instruments, again it wont take much space in RAM, but can be accessed at faster speeds than from DB. Wow!
I really appreciate your inputs, will try to implement them, if any queries will post them here for others benefits.
if Dict1[symbol] != Dict2[symbol] if Dict1[symbol] == Dict2[symbol]
Saw a lot of comparison by looping through each of the key/value pairs and then there were some comparison of hashes of the two dictionaries, but seems like the simple == operator works.
I never considered a need to store candles/ticks into a formal database (SQL). I do not consider a third-party DB to be an efficient approach in a real-time scenario.
Create an in-memory database using list or dictionary per token for incoming ticks (call it incoming DB). At the period boundary, copy/append the incoming DB to corresponding in-memory database (list or dictionary) (call it historical DB). Now historical DB can just be pickled & loaded into/from the File System.
@neel2323 what you are getting is not an error. Its perfectly alright to get slightly different candle values. Dont worry much about it if you are processing all the ticks received without blocking the on_ticks receiver thread. The code i have posted is lightweight, it will not block the main thread.
This is a pretty interesting thread. Can someone tell me how of data size is for one good fluctuating instrument for every tick for a day? Assume if I stored all data into a csv file.
@zeeshaanbond for 750 liquid instruments, its 2GB per day when saved in csv without any compression. thereafter one can archive using 7zip and archive size gets reduced to less than 200MB.
I am using this approach 1)collecting ltp token DateTime, 2)Dump to CSV 3)read from CSV resample all using resample () method 4)Pass the data frame to strategies
All taking place within loop Can you suggest to me a better technique as this is having some inaccurate ohlc data
Or is this inaccuracy due to missing ticks in between
@akshay07 one should use the main data reception thread for bear minimum task.
Unless u can calculate the indicator using light weight operations with np.array , keep the task of indicator calculations to a seperate thread.
Almost every algorithm anyways need indicator values at the close of the candle, so calculating indicator on every tick processing is likely to be wastage of resources in most of the cases.
One may have a clock or scheduler every 1 minute in a seperate thread and do post processing of candles over there.
@amitchugh It depends on how you are running your scripts and on what os. I use linux and the simplest way is to run each script in a separate terminal window. Another option is to start the scripts to run in background mode ( python myscript.py &) Since I run these scripts everyday the process is automated. i run them through cron scheduler to launch automatically at 9:05 am and they run in the background. I have implemented logging and i can look at the log files to know whether my scripts have launched, when they launched and what they are doing.
you said you run 3 scripts, and 3rd script to make candle
for example you are making 1 minutes candle's ohlcv from ticks, at a particular time , you make all candles again to get the data of the last candle, or just make only last candle and append it to candle's dataframe. ?
@amitchugh If you read my comments you will see that I use pandas to create the candles so it's simpler to reprocess the whole ticks list and recreate all candles and overwrite the previous candle data instead of segregating the data and trying to create only the last candle. Search for the line containing ticks.ltp.resample( '1min' ).ohlc() in this thread. The exact code is out there for anyone to read and copy.
@amitchugh I am a Unix/Linux professional. Unix/Linux is the best os on the plant because it follows the following tenet "Keep it simple stupid" also abbreviated as KISS.
Putting multiple functionality under one script will lead to unnecessary complications. 1. More complicated the code the more the chances for things going wrong. 2. And when things go wrong the more the effort to trace and rectify the fault without breaking something else. 3. Also when functionalities are kept separate, one script could fail and only that functionality would fail. The others would continue to operate.
@npchoubey You wouldn't make such comments if you had any background in writing/operating/managing large scale systems. You have no idea of the scale and complexity of systems needs to make something like Zerodha operate the way it is operating right now. I do because I work for one of the largest internet brand names in India. Also historical API does not need to be efficient in terms of time taken to generate candles because it is supposed to be used for historical look ups. Its not intended for real time trading esp intraday. Can zerodha make it efficient and generate candles in realtime? Yes they can. But it will probably require three times the hardware investments and datacenter resources do not come cheap.
@MAG thank you so much for your kind response. Little things make big difference. A professional can guide us this way. Thankx a lot. Lucky i found you for guidance.
(ignore it, solved, if have better way to do it, plz guide) hi again. need guidance
i am saving ticks in a dataframe df which have 2 columns time and ticks
then i convert time column to datetimeindex and set it as index colum, so that i can resample it df1=df df1['time']=pd.to_datetime(df['time']) df1 = df.set_index('time')
then i resample it and make candle df1=df1.resample('1T').ohlc().dropna()
i got the desired output. but in wrong format, now output of df1 is
ticks is the only column time,open,high,close,low are the subcolumns of ticks
if i want the value of open, i have to write df1.ticks.open instead of df1.open
how to make it in normal dataframe table like below time open high low close value value value value value
HI again somehow , i am able to paper trade 1 script with my system successfully. now i want to do it with more than 1 script, to get more better results of this system, i am using tick data, and not able to understand how to differentiate ticks of different instruments, every time ticks result is different. it do not provide ticks in any fix sequence of instruments. can you plz help me out how can i achieve that with help of dataframe, any lead,reference,idea will do
@sujith
If i subscribe 3 tokens, some time it give data of 1 token, sometimes it gives data of 2 tokens. So i am not able to save it to different dataframes.
can you plz tell in respect of dataframe, how to differentiate each tick in live market. How to save each tick to their respective dataframe ?
Also, looking at your code "df1 = df" i would recommend below links to get acquainted with python (and dataframes in your case), before you dig deeper into automating your strategy with kite connect. Some of these are well maintained official documentation and would help you get a good knowledge base and speed up your code development.
@milli, as long you are not into High Frequency Trading, 1 minute candle is enough to trade and u can trust Zerodha for that. 1 minute candle can either be taken using Historic api subscription or writing a little scrapping code by looking into WebConsole of browser. ( I wont post here as its "illegal" to scrap ). There are N number of people using scrapping to read candles and there are M number of people who use scrapping to execute orders as well.
for record, I personally use historic api subscription.
@milli Please make efforts to go through Nov 2019 posts in this thread.
@npchoubey Historic api is not ideal for live trading, For live trading one has to follow websocket approach as is mentioned repeatedly on this forum. Basic websocket usage example is also documented in official kite connect documentation https://github.com/zerodha/pykiteconnect.
Is there anybody here whose candle formed by tick data is exactly matching(Volume and Close Price) with the kite api candle? In my case it's not matching for some of the scripts.
@dipgupta1986 they wont match and its fine, your candles are also correct. A good algorithm performance doesnt gets hampered due to slight change in candles ohlc.
Nithin Kamath has explained this difference in detail on tradingqna.
how to calculate total traded volume using tick data at end of every minute? Do i need to sum last traded quantity for all the ticks recevied in a minute, Because my dobut is i could see same traded quantity in multiple ticks recevied .
The volume you receive is a cumulative value, i.e volume traded till that second from the start of the trading session. You need to subtract last tick's volume and first tick's volume.
@sujith , if 2 ticks that are received with same timestamp implies that the extra one is duplicate tick?
Ex :
{token : 408065, Latest Volume Today : 6040361.0, Last Traded TIme : Thu Jan 06 15:18:00 IST 2022, Tick Time Stamp : : Thu Jan 06 15:18:00 IST 2022} {token : 408065, Latest Volume Today : 6040361.0, Last Traded TIme : Thu Jan 06 15:18:00 IST 2022, Tick Time Stamp : : Thu Jan 06 15:18:00 IST 2022}
@Shaha you mentioned that "Historic api is not ideal for live trading". can you please explain why is that exactly? i see in a lot of places some people are using the historic api for live trading. also when is historic api usage recommended.
@rakeshr thanks for the thread link. i now understand the reason, also it is mentioned that while creating the candles from websocket data and storing them in database, " if one of the database write operations lags then it will affect the following writes also." so, should the database write operations be performed in a separate thread each time (every minute) and then close that thread, each time? I have one more question, should we store each one minute candle in a persistent db like mongodb or postgres, or just in memory? what is the preferred approach
DB is not essential. I maintain candlesticks in an in-memory list of lists. See my comment above. The in-memory DB is pickled to a file before exiting.
Anyone still struggling to convert ticks to candles of different timeframes, here is how I did it using different threads to process, form candles, and save them into a datastore if required for parallel processing.
I want to do computation on each and every tick data. Can some one guide me whether my approach is correct or not? I used thread to process the data. This is the below code snippet.
def on_connect(ws, response): ws.subscribe(FandOInstrumentsSubscribe) ws.set_mode(ws.MODE_FULL,FandOInstrumentsSubscribe) def on_close(ws, code, reason): # On connection close stop the main loop # Reconnection will not happen after executing `ws.stop()` ws.stop()
def process_data(): while True: data = data_queue.get() for tick in data: print(tick) #do the computation with the data thread2 = threading.Thread(target=process_data) thread2.start()
One of the ways is to create 15minute candles using 15 1minute candles.
To create a minute data, you need to listen to all the ticks that come in and compute candles as follows,
Open - > LTP of first candle.
Close -> LTP of last candle.
High -> highest LTP of that minute.
Low -> lowest LTP of that minute. There are no special restrictions on Bulk fetch API call. All HTTP calls to Kite Connect are capped at 3 requests per second. You can make any number of requests to bulk fetch API just make sure you don't cross the 3 requests per second limit.
and is 1 bulk api call = 1 api call ?
1 bulk API call is still one request to the server.
Is it efficient?
2) does API support "Bracket order with trigger entry" as of now?
2) BO with trigger, I am guessing you mean SL-M or SL based entry. Thats something that they are planning to do in couple of weeks in Kite API 3.0
This involves some learning curve but is superfast though you will be limited by memory if you are doing it for more than a few hundred instruments simultaneously.
Time permitting; I may make a detailed blog post on this about a month down the line.
To speed up operations you can configure mongodb to run off an small capacity ssd say 128gb. This should be able to handle a few days worth of tick data. You can always back up to a secondary nosql / rdbms running off a regular HDD every few days if you want to preserve older tick data for backtest or any other reason.
But what many people do not speak about is that redis converts the data to a utf-8 encoded str. When you read data back in from redis, you need to convert this str back into a python object using eval or json.loads and that kills whatever efficiency you gain by using redis.
On the other hand if you use mongo, its a bit slower than redis but when you read data back from mongo, it is already a python dict object.
Here are results from my tests
In fact if i am generating 1 min candles. I get my candles formed within 1 second of the minute rollover. Whereas some of the popular charting tools available take up to 20-30 seconds for updating the candle after minute rollover.
No blog yet. But looking at how people are lost, I think I will make one. And also make a public github repo with code samples. It will take some time though as I need to balance other demands on my time.
Also how does one put the stream into a python dict or object for analysis what is the limit of python variable compared to databases . Thanks
As for your query of putting a websocket data stream into a python dict and doing analysis on it, I will be posting some info soon.
For python dicts, basically its an issue with how much memory you have. If you are doing one or two stocks and intraday, you can easily keep it in memory without using an external db. But if the script crashes or terminates, your data is lost. Even if you restart the script, you cant pick up from where you left off from.
So always advisable to use a persistent external store.
Great Post. This is definitely one of the biggest problems the community is currently facing.
I understand that you are getting ticks in dicts and relying on pandas to form candles and subsequent calculations/analysis. Howevever I have faced issues with conversion of dicts to pandas and found it slow especially if I want to work with 200+symbols simultaneously. It may take anything between 1-2 seconds which I am not happy about.
Any observations/suggestions here?
My Experiements on the topic:
1. Now-a-days we have excellent time series databases and I have experimented with InfluxDb. The influxDB is designed with this exact usecase: keep storing incoming real-time time-series data rapidly and then reduce the frequency of data (i.e. make 1 min or whatever period candles automatically at end of each minute) and store it. It does a remarkable job of storing whatever Kite throws up without losing sweat. The problem is retrieval. Documentation clearly mentions that it needs SSD and high RAM but I am using traditional disk with 8 GB RAM on i3. On my machine, it takes a 2-3 seconds to load data in pandas. Unlike traditional databases which may take time to understand and implement, it is extremely simple to implement. It has a python library and all database capabilities are exposed there. A beginner can get it up and running in 2-3 days. But like I said, the latency is loading data in pandas is on higher side.
This is similar to mongo db solution however we have replaced mongodb with influx db which is designed for this specific use case, and forms candles on its own. Being a db, persistence is taken care of.
2. If you are happy with candles in form of python dict, there is a simpler and faster solution.Simply form a dict for your ticker yoursefl with 1 min timestamps as keys (aka 915, 916, 917) and relevant data in values as namedtuple/nested dict: (open=10, high=12, low=8, close=10, volume=1111). When a new set of data comes from websocket, loop through the incoming data and for your symbol, pick up the time stamp of the incoming data, strip out the minute part and using that minute as key, update the params in the dict. If there is no data for opn in the minute, update all fields with the new tick. If there is existing data and is higher/lower then high/low fields in the dict, update them as well. Finally, always update the close with new tick. So you have realtime candles. Say you are at 920 min and when the timestamp of new tick has minute part as 921, it will automatically switch to 921 candle. Working with 700 tickers, it has worked realtime till EOD. Further you have the live candles while they are being formed, rather than forming candle only once at the end of the minute. For those who are willing to put the efforts they may get realtime value of the technical indicators by adding more fields to namedtuple and populating them realtime by inserting required formulae in the loop.
Further, if you look at Zerodha client library for Websocket, what it does is it receives a big blob of data from zerodha server, which contains data for multiple tickers. It then loops over the each row to extract data in form of dict and throws a list of dicts to user. The candle formation algo (as per 2) can be appended to the loop in Zerodha's client library to save additional time.
The issue is that while you have realtime candles in dict, when you want to manipulate data in array like dataframes or numpy then conversion of dicts to these structures eats 1-2 seconds and more as day progresses/no of tickers increases, which again is on higher side.
If someone can point to a in-memory data structure which can be used by both dict and pandas (with blazing fast retrieval in pandas) then maybe this can turn up as simple and viable option for processing in pandas. For those who are willing to live with a few seconds of delay in getting candles in pandas, this can be easily implemented.
There is no persistence here but this being a home grown solution it can be added.
I have found both solutions as easy and practical subject to latency in getting candles in pandas.
Instead of starting a new thread I am joining this thread with my queries as these are quite relevant up to candle generation and storing data.
I have a question on the ongoing discussion on choice of an appropriate DB. My requirement is very limited , I want to subscribe to only 3 or 4 tokens. I plan to use the following work flow: -
1. Collect ticks between T minutes to T+ delta minutes , where delta is the resampling frequency. T & delta are integer.
2. Convert the ticks to pandas DataFrame with timestamp as index and then resample to get the High and Low of the candle.
3. Process the strategy with the help of pandas tools. Execute the strategy
4. Write to csv files (one for each token) at regular intervals say every 30 minutes. I do not want to query CSV. It will be called for loading data only once at the start.
5. Instead of using multithreading I intend to use redis as message broker and celery as task manager.
Q1.Does this appear to be OK for my type of requirement involving limited tokens? Use of pandas , redis - celery combo and csv...
Q2. Since I plan to use "for tick in ticks" under on_tick method , will a code entrusted to celery for execution be executed on receipt of every tick? For example , if my candle frequency is 15 minutes, the relevant re-sampling code need not be executed on receipt of every tick.
Hope to receive your valuable comments.
I am trying to get a sense of what you are trying to do. However, the devil in details and I am not able to grasp a complete picture. To answer in general:
a. Simple is better then complex. A simpler algorithm is better than a complicated one. Your algo of collecting ticks and then re-sampling at the end of the delta sounds simple and practical and something which can be rolled out rather quickly.
b. Re-sampling data in pandas is performant. Getting real-time stream of data into pandas, as when data arrives, isn't. From what I understand, for your usecase requiring few tokens, this should be fine.
c. Another issue is latency. If you are operating with 1 minutes candles, you may desire to form the candles, run strategy and pass orders within 1-2 second of minute getting over. As your periodicity increases, you will be able to put with more time lost in calculations.
Overall, your idea is practical, something which large majority might already be doing. I suggest you go ahead and experiment with it.
d. On writing to csv files: I would imagine that if data is being written, it is envisaged that they may be read in future. Generally speaking, databases are more efficient here as they allow you to query even if its one time loading at startup. If databases means a significant learning curve, then CSV are great second choice. With csv, you may have to think on issues like directory hierarchy. If you need to load data by day then maybe you can organize data in Year>Month>day>ticker.csv structure or a simple ticker_yyyy_mm_dd.csv files. There, you have to be careful that csv presents some issues (minor ones) on datetime fields while being read on pandas. If CSV is the way to go, I suggest you look at IO related documentation on pandas to check if it can be easily replaced with more performant alternatives like feather or pickle.
e. Celery paired with redis is a great choice which I using quite happily. But I am not clear on how you are using it. if you put a celery task on_tick method it will be executed for every tick. I guess you want to run it when candle is fully formed. That should be easy to achieve. I am imagining you would have put some conditional logic in your code for app to decide that current candle is over and next one starts. It could be based on the your computer's clock like at 9.30: the 9.15 15-minute candle ends and next one starts and so on. You can embed this logic inside on-tick method and call celery task only when a candle ends. This should be easy to achieve once you put your mind to it.
Hopefully it helps.
Regards
Saurav
So thanks again for replying to my queries.
kind regards.
First of all I do not understand why it takes you 2 seconds to load data into pandas.
Zerodha websocket library gives you a list of dictionaries. Each dictionary in the list is a tick for one individual instrument. Once you get the tick list, you should first break it down into its individual components per stock / instrument and then process it further. I guess you are loading the entire tick into a pandas dataframe and trying to process it using pandas. Thats the wrong way to go about it.
The way I have written my system
All three scripts are running in parallel and the result is near real time generation of the candles in csv format. I have timed it. Say the clock turns over to 10:14:00 and at 10:14:01 or 10:14:02 I have the ohlc candle for 10:14
The entire process takes milliseconds. I do not understand why its taking two seconds for you. Its obviously to do with your logic. There is a vast scope for performance improvement. It has nothing to do with pandas. It's more to do with the way you are using pandas. Sorry there. I don't mean to put you down. That's not the intention.
Regarding inlfluxdb
I looked at influxdb and ruled it out immediately as its a commercial product that needs a license. As an individual user, I cannot afford licenses and it does not make sense when there is a plethora of open source tools to choose from.
Your point no 2. about generating candles using in memory python dicts is totally off the mark as it is inefficient, complex to code and prone to logic errors whereas pandas is designed to do this very thing accurately and efficiently. It needs a lot of thought in the logic etc esp the performance improvement. It's taken me quite a bit of time ,effort and iterative development to get this far.
Here is a screenshot of my candle generation time for one stock that I ran just now. This is at the fag end of the day at 3:28 when it has collected 23878 individual ticks for ADANIENT
This includes retrieving each new tick as it comes in from the db, appending it to a list of dicts and then loading that data into pandas, generating the ohlcv, converting the pandas ohlcv into a csv string and then writing the 1 min candles in csv format to db. all in about .2 seconds avg.
Thats still a hundred thousand microseconds too many. In fact as I was writing this answer I realised that I could probably shave off a few hundred microseconds on top of this as well as reduce the memory usage.
You can in fact skip celery too. Its not needed. if you are subscribing to just 4-5 instruments, all you need is python, the kiteconnect library, redis and pandas. The efficiencies are all in the way you code. Unfortunately writing code is many more times complex than even chess. Meaning there are even more ways of achieving the same result and as sauravkedia mentioned, the devil is in the details.
Sometimes instead of looping over a data set repeatedly using a for loop, if one converts it into a dictionary and does a key lookup/ reference as showin in the following code snippet, one can reduce lookup times from 2 seconds to 3 ms. I did that for the instrumentslist - mapping instrumentid to symbol names. Convert it to a dictionary once in the morning, pickle it and load the pickle when needed.
Your screeshot is exciting. If you can share a simple implementation (even if a psuedocode), it will be great otherwise we will get lost in same devil in details crack.
Additionally, Influxdb is free to use and I am using their free version, though they do have a paid enterprise version as well.
Further, I found the dict method easier to implement. Reason being that its pure python, and doesn't require one to deal with an external database (especially for those, who dont have experience in databases). The coding of ohlcv values is easy. Being dict, its super fast. But I would agree that one has to give some thought to getting a right structure for the dict. I used a nested structure like tickers['ACC']['2018-002-07 9:25:00']['open'].
Regards
Saurav
script1 gets ticks from websocket. For V2 API it's a list of up to 200 stocks in each tick.
Something like :
tick = [
{"instrumentToken":1, time: 9:15:00.200, other data...},
{"instrumentToken":2, time: 9:15:00.200, other data...},
{"instrumentToken":3, time: 9:15:00.200, other data...},
{"instrumentToken":4, time: 9:15:00.200, other data...},
{"instrumentToken":5, time: 9:15:00.200, other data...},
...
]
Script2 splits it into individual queues one per instrument token
instrument1 = [
{"instrumentToken":1, time: 9:15:00.200, other data...},
{"instrumentToken":1, time: 9:15:00.400, other data...},
{"instrumentToken":1, time: 9:15:00.800, other data...},
{"instrumentToken":1, time: 9:15:01.200, other data...},
{"instrumentToken":1, time: 9:15:01.300, other data...},
{"instrumentToken":1, time: 9:15:01.500, other data...},
{"instrumentToken":1, time: 9:15:01.900, other data...},
{"instrumentToken":1, time: 9:15:02.150, other data...},
{"instrumentToken":1, time: 9:15:02.430, other data...},
{"instrumentToken":1, time: 9:15:02.790, other data...},
{"instrumentToken":1, time: 9:15:02.970, other data...},
]
instrument2 = [
{"instrumentToken":2, time: 9:15:00.200, other data...},
{"instrumentToken":2, time: 9:15:00.400, other data...},
{"instrumentToken":2, time: 9:15:00.800, other data...},
{"instrumentToken":2, time: 9:15:01.200, other data...},
{"instrumentToken":2, time: 9:15:01.300, other data...},
{"instrumentToken":2, time: 9:15:01.500, other data...},
{"instrumentToken":2, time: 9:15:01.900, other data...},
{"instrumentToken":2, time: 9:15:02.150, other data...},
{"instrumentToken":2, time: 9:15:02.430, other data...},
{"instrumentToken":2, time: 9:15:02.790, other data...},
{"instrumentToken":2, time: 9:15:02.970, other data...},
]
instrument3 = [
{"instrumentToken":3, time: 9:15:00.200, other data...},
{"instrumentToken":3, time: 9:15:00.400, other data...},
{"instrumentToken":3, time: 9:15:00.800, other data...},
{"instrumentToken":3, time: 9:15:01.200, other data...},
{"instrumentToken":3, time: 9:15:01.300, other data...},
{"instrumentToken":3, time: 9:15:01.500, other data...},
{"instrumentToken":3, time: 9:15:01.900, other data...},
{"instrumentToken":3, time: 9:15:02.150, other data...},
{"instrumentToken":3, time: 9:15:02.430, other data...},
{"instrumentToken":3, time: 9:15:02.790, other data...},
{"instrumentToken":3, time: 9:15:02.970, other data...},
]
script3 takes the list of dicts per instrument, one at a time and converts it into ohlc as follows:
import pandas.
pd2 = pandas.DataFrame( instrument1 )
ticks = pd2.ix[:, [ 'ltp' , 'volume' ] ]
priceOHLCV = ticks.ltp.resample( '1min' ).ohlc()
candledata = priceOHLCV.to_csv() # converts the pandas dataframe candle data to csv format written to db which can be easily processed further.
*still learning about pandas so maybe I can do this even more efficiently in the future.
" If so, and if say, you get 100 datapoints in dump, will writing 100 lines into db one by one be performant"I use a combination of mongodb and redis. There are places where each shines.
The point is I broke it down into individual steps and individual components.
Currently I can do a handful in near realtime so my logic is solid. I haven't tried hundreds. Maybe my system will choke when that happens. In which case I can solve the computational issue separately:
Thanks
Regards
Thanks for the nice explanation. Your suggestions are well understood and taken. I am using celery to take care of the scalability in future. Before switching to celery , I read comparisons between celery and other alternatives and found celery scoring reasonably high. But yes , for fewer scripts, we can use other methods advantageously. I collect all the ticks in main thread and entrust celery to do rest of the stuff - grouping (based on token) , resampling + analysis using pandas and writing to csv every "N" minutes. So I save only processed data (with TIs) and not the raw data(tick). However, if some latency is permissible, I think saving the data to a DB and then processing at will should be an option if simultaneous reading from and writing to the DB is taken care with the choice of DB. Which DB do you use?
To simulate live streaming data , I used a historical data set (1 minute candles) and put it in a for loop for releasing one data point randomly at random time interval. So that becomes my tick. So even on market off days , I could test my code on streaming data. This was very helpful to me.
Your suggestions on hardware part are quite interesting. May be this can be discussed in a separate thread in more detail.
Regards
I have worked with this algo before and it works well. This algo is an obvious first choice: keep collecting ticks in form of dict, and when candles are required , convert to pandas and then use excellent pandas resampling to get candles. For a few tickers , this is goto approach and there is no need to look beyond it.
I am not putting too much attention to retention here as 1. it can be added easily as you have done and 2. It maynot be required as the candles can be fetched using Historical data api.
However, 1. I wanted to scale it up to few hundred tickers 2. Wanted to have the candles as they are being formed and not once at end of period, and 3. All this while keep all candles real-time/close to real-time. I had trouble scaling this up with existing infrastructure.
So I tried two alternate approaches: 1. InfluxDB and 2. forming candles real-time the way I have explained earlier. I am happiest with last algo and likely to stick to it.
Regards
Saurav
Thanks for sharing the information. It is really helpful. Can you please also share the number of instrument_tokens(symbols) you are working with so that we can have a sense of this realtime performance with regards to the number of symbols as well.
@sauravkedia : can you please put more light on your statement : "I had trouble scaling this up with existing infrastructure.". did you have trouble scaling it to 100/200/500/1000 tickers and what was the latency you observed?
Also, looks like you are creating candles on the go and not by first saving it to some database and then using a parallel script to create candles. Please correct me here.
I looked at influxdb and ruled it out immediately as its a commercial product that needs a license. As an individual user, I cannot afford licenses and it does not make sense when there is a plethora of open source tools to choose from.
"
Influxdb has got an open source version without licensing restriction as far as I can see also which is what I'm using. I have been using it for about a week now. Can you please confirm if there is really a license on it?
So, Let me take this opportunity and get the credits :-)
Check out my example that i have been using from years very smoothly, on below link...
www.ezeetrading.in/Articles/Candles_formation_from_tick_data_zerodha.html
Click here for Code
Enjoy ..!
If it helped you, please maintain my credits while using or Forwarding. :-)
Hopefully others will now post their methods.
for ticks to candle go through this link:
https://kite.trade/forum/discussion/6230/program-for-converting-tick-data-into-candlestick-data#latest
I am struggling on one part. The volume piece of it.
So every tick is 1 seconds apart. How do I correctly compute the total volume for the 1 minute candles formed using these 60 ticks in a minute. I mean which field should I use and keep accumulating over the 1 minute period?
I am using the websockets
df.resample('5min')[volume].sum()
where:
df = convert the receiving websocket data to dataframe
['volume'] = get only the volume column from the dataframe df
sum() = adds up all the volume in the desire timeframe
to include queues in python. I have given example implementation for both 1min and 15 min candles
The code is uploaded in
https://gist.github.com/oldmonkABA/f73f5d67bcf85316ac0470919bf54e17
At the end of the day, I find mismatches between my candles and candles available from Historical Data.
Is this normal to have discrepancies? Since Zerodha provides 1 tick/second data, isn't it possible that we are missing out on data from other ticks within the same second?
**For the time in tick, tried separate codes for "last_traded_time" and "timestamp" and found errors in both. Which time should I use for comparing/entering the tick in candles? last_traded_time or timestamp?
p.s I am not missing out on entering any ticks as I have a try: except loop and logging.exception in place. In addition, my code is running on google cloud and internet is also not an issue.
Any help on this would be appreciated!
3 months back I started with Zerodha API and was using CSV, then moved to MySQL and was looking at other time series DB's, then I found with my tests SQLite file based DB is faster than other DB's. So have been using SQLite since couple of months now. However, now I see that I need to move to in-memory DB for tick, so that I can convert them to 1min candles faster to store them in SQLite DB. So now I have been looking at Redis and SQLite in memory DB. If Redis takes longer time to convert from Str to python object, then SQLite in-memory would be faster?
Edit: I have another query:
When you convert tick to candles at regular intervals from in-mem to another DB, to get latest candle info for your algo/code calculations, would you pull lets say 1 min data from DB and stuff/append with latest info from in-mem DB?
For eg: I have stored 1min candles from 09:15 to 09:26 but want to get latest info at 09:28, so pull data from DB until 09:26 and append the 2 latest candles (by sampling) from in-mem db?
In other words... i prefer, tick --> candle dictionary --> DB instead of
tick--> DB-->candles..
I just implemented what I had mentioned above tick ---> tickDB ----> OHLCcandles (this is one process which does it periodically) storing of candles. Another for fetch which gets the candles from OHLCcandles DB and stitches/concats the last bar from tickDB. After a lot of optimizations candle retrieval and concat is happening in 0.1454245 seconds (if I am constructing same number of candles from directly tickDB it is taking around 0.4545436 seconds). I try to avoid pandas as much as possible, but this above operations are all with Pandas.
Will try to implement what you have suggested here. Some questions:
1. I see you have mentioned in this thread you do http://ezeetrading.in/Articles/Candles_formation_from_tick_data_zerodha.html for about 400 instruments, how often do you save your 1min dictionaries to the DB the days data?
2. And to save the dictionary to the DB you use another thread? It can't be another python process as the dictionary will not be accessible.
3. Once you have dumped the 1min candles to the DB from dictionary, do you delete the key/value pairs for those entries copied to db? Or do you keep the candle dictionary throughout the day and let it grow in size in memory? If yes for 400 instruments, what is the size taken in memory?
4. Lets say for your code it needs data more than today, then you would get/fetch the data from your DB and append todays data from dictionary? If this is the case then, the new program which needs the data cannot access the dictionary, if not spawned as a different thread in the same process right?
Please bear with my questions, I am not a typical programmer (thanks to google and stackoverflow) I am sure I will have more questions.
1) at end of day. for record keeping only.
2) u can easily access by making the variable global.
3) negligible, i run all my code with over 700 instruments and over 80 strategies on 1 GB Memory RAM AWS server, which comes under Free tier.
4) read previous days candles in the morning before market opens in our global variable.
its just the basic knowledge about programming and your preferred programming language, that makes task simple. Sometimes we overthink about "how to?" in wrong direction and makes things complex.
The reason these bother me are cos my technical parameters get affected specially in terms of crossovers
1. Ok.
2. Ok.
3. Wow! 700 instruments and 80 strategies and on 1GB. Nice! Ah! right, you are not saving tick data, but 1 min candles in dict, so it would not take much space, sweet! Good to hear the free tier is able to handle 700 instruments and 80 strategies. Nice.
4. As you are importing 1min candles for 700 instruments, again it wont take much space in RAM, but can be accessed at faster speeds than from DB. Wow!
I really appreciate your inputs, will try to implement them, if any queries will post them here for others benefits.
candles_1[instrument][ltt_min_1]["volume"]=max(candles_1[instrument][ltt_min_1]["volume"],tick["volume"]) #1.5 Use the max in volume instead of last.
Shouldn't the last received volume in the tick be the max as well?
if Dict1[symbol] != Dict2[symbol]
if Dict1[symbol] == Dict2[symbol]
Saw a lot of comparison by looping through each of the key/value pairs and then there were some comparison of hashes of the two dictionaries, but seems like the simple == operator works.
Create an in-memory database using list or dictionary per token for incoming ticks (call it incoming DB). At the period boundary, copy/append the incoming DB to corresponding in-memory database (list or dictionary) (call it historical DB). Now historical DB can just be pickled & loaded into/from the File System.
Assume if I stored all data into a csv file.
1)collecting ltp token DateTime,
2)Dump to CSV
3)read from CSV resample all using resample () method
4)Pass the data frame to strategies
All taking place within loop
Can you suggest to me a better technique as this is having some inaccurate ohlc data
Or is this inaccuracy due to missing ticks in between
@sauravkedia @MAG
Unless u can calculate the indicator using light weight operations with np.array , keep the task of indicator calculations to a seperate thread.
Almost every algorithm anyways need indicator values at the close of the candle, so calculating indicator on every tick processing is likely to be wastage of resources in most of the cases.
One may have a clock or scheduler every 1 minute in a seperate thread and do post processing of candles over there.
I use linux and the simplest way is to run each script in a separate terminal window.
Another option is to start the scripts to run in background mode ( python myscript.py &)
Since I run these scripts everyday the process is automated. i run them through cron scheduler to launch automatically at 9:05 am and they run in the background. I have implemented logging and i can look at the log files to know whether my scripts have launched, when they launched and what they are doing.
thank you for the quick reply.
you said you run 3 scripts,
and 3rd script to make candle
for example you are making 1 minutes candle's ohlcv from ticks, at a particular time , you make all candles again to get the data of the last candle, or just make only last candle and append it to candle's dataframe. ?
Search for the line containing ticks.ltp.resample( '1min' ).ohlc() in this thread.
The exact code is out there for anyone to read and copy.
One more thing , Any reason for not using threading in 1 script to accomplish all 3 tasks instead of 3 different individual scripts ?
Putting multiple functionality under one script will lead to unnecessary complications.
1. More complicated the code the more the chances for things going wrong.
2. And when things go wrong the more the effort to trace and rectify the fault without breaking something else.
3. Also when functionalities are kept separate, one script could fail and only that functionality would fail. The others would continue to operate.
@npchoubey You wouldn't make such comments if you had any background in writing/operating/managing large scale systems. You have no idea of the scale and complexity of systems needs to make something like Zerodha operate the way it is operating right now. I do because I work for one of the largest internet brand names in India. Also historical API does not need to be efficient in terms of time taken to generate candles because it is supposed to be used for historical look ups. Its not intended for real time trading esp intraday. Can zerodha make it efficient and generate candles in realtime? Yes they can. But it will probably require three times the hardware investments and datacenter resources do not come cheap.
(ignore it, solved, if have better way to do it, plz guide)
hi again.
need guidance
i am saving ticks in a dataframe df which have 2 columns time and ticks
then i convert time column to datetimeindex and set it as index colum, so that i can resample it
df1=df
df1['time']=pd.to_datetime(df['time'])
df1 = df.set_index('time')
then i resample it and make candle
df1=df1.resample('1T').ohlc().dropna()
i got the desired output. but in wrong format, now output of df1 is
ticks is the only column
time,open,high,close,low are the subcolumns of ticks
if i want the value of open, i have to write df1.ticks.open instead of df1.open
how to make it in normal dataframe table
like below
time open high low close
value value value value value
HI again
somehow , i am able to paper trade 1 script with my system successfully. now i want to do it with more than 1 script, to get more better results of this system,
i am using tick data, and not able to understand how to differentiate ticks of different instruments, every time ticks result is different. it do not provide ticks in any fix sequence of instruments. can you plz help me out how can i achieve that with help of dataframe,
any lead,reference,idea will do
Each tick has the instrument token in the first 4 bytes. You can know more here.
If i subscribe 3 tokens, some time it give data of 1 token, sometimes it gives data of 2 tokens. So i am not able to save it to different dataframes.
can you plz tell in respect of dataframe, how to differentiate each tick in live market. How to save each tick to their respective dataframe ?
df["instrument_token"] = the_inst_token_recieved_in_tick_data
Also, looking at your code "df1 = df" i would recommend below links to get acquainted with python (and dataframes in your case), before you dig deeper into automating your strategy with kite connect. Some of these are well maintained official documentation and would help you get a good knowledge base and speed up your code development.
https://pandas.pydata.org/docs/user_guide/10min.html#min
https://pandas.pydata.org/docs/user_guide/index.html#user-guide
https://www.w3schools.com/python/
https://docs.python.org/3/tutorial/
Thanks
as long you are not into High Frequency Trading, 1 minute candle is enough to trade and u can trust Zerodha for that.
1 minute candle can either be taken using Historic api subscription or writing a little scrapping code by looking into WebConsole of browser. ( I wont post here as its "illegal" to scrap ). There are N number of people using scrapping to read candles and there are M number of people who use scrapping to execute orders as well.
for record, I personally use historic api subscription.
@npchoubey Historic api is not ideal for live trading, For live trading one has to follow websocket approach as is mentioned repeatedly on this forum. Basic websocket usage example is also documented in official kite connect documentation https://github.com/zerodha/pykiteconnect.
Nithin Kamath has explained this difference in detail on tradingqna.
Some of his posts :
https://tradingqna.com/t/why-does-two-charts-of-the-same-timeframe-look-different-on-the-same-platform/4715/3
https://tradingqna.com/t/why-does-the-same-technical-charts-for-the-same-stock-from-nse-tame-google-finance-yahoo-finance-amibroker-not-match/1688/12
Ex :
{token : 408065, Latest Volume Today : 6040361.0, Last Traded TIme : Thu Jan 06 15:18:00 IST 2022, Tick Time Stamp : : Thu Jan 06 15:18:00 IST 2022}
{token : 408065, Latest Volume Today : 6040361.0, Last Traded TIme : Thu Jan 06 15:18:00 IST 2022, Tick Time Stamp : : Thu Jan 06 15:18:00 IST 2022}
so, should the database write operations be performed in a separate thread each time (every minute) and then close that thread, each time?
I have one more question, should we store each one minute candle in a persistent db like mongodb or postgres, or just in memory? what is the preferred approach
https://gist.github.com/sumitsk20/4a7e9cf667852590e44f6275c06caaa0
This is the below code snippet.
data_queue = queue.Queue(maxsize=0)
def on_ticks(ws,ticks):
data_queue.put(ticks)
def on_connect(ws, response):
ws.subscribe(FandOInstrumentsSubscribe)
ws.set_mode(ws.MODE_FULL,FandOInstrumentsSubscribe)
def on_close(ws, code, reason):
# On connection close stop the main loop
# Reconnection will not happen after executing `ws.stop()`
ws.stop()
def process_data():
while True:
data = data_queue.get()
for tick in data:
print(tick)
#do the computation with the data
thread2 = threading.Thread(target=process_data)
thread2.start()
# Assign the callbacks.
kws.on_ticks = on_ticks
kws.on_connect = on_connect
#kws.on_error = on_error
kws.connect()
kws.on_close = on_close