On recent downtime and what we're doing

Kailash · July 2016

This in response to @sudhirshettyk's question here -- https://kite.trade/forum/discussion/349/regarding-earlier-thread-of-july-1-2-important-oms-connectivity-issues

A post was in order explaining what we're doing to deal with these issues. Here, I'll post it rather prematurely, because discussions are still happening. We absolutely understand the seriousness of the situations. If we did not, we'd be fools in even attempting something like Kite Connect, and I'd be a clown of a "CTO".

The OMS
---------------
Thomson Reuters (TR) Omnesys is our OMS vendor. The entire OMS ecosystem is a very complicated piece of technology, hence most brokers choose battle tested vendors. The OMS itself is composed of several layers, and there's one layer that Kite Connect interfaces with. While the core itself is stable, the proprietary layer TR has exposed to us has developed issues with scale. Pushing them to get this into production has taken us about a year (even before Kite and Kite Connect).

Network / Data centre (DC)
--------------------------------------------
Our primary DC is NetMagic (India's biggest) in Mumbai. This is were the entire TR OMS ecosystem resides, and it's fully managed by them (we do not have direct access). The system resides on top-of-the-line servers (128 core CPUs, 512 GB RAM etc.). We connect to the exchanges from here via leased lines.

Our secondary data centre is powered by ILFS, which is also in Mumbai. Here, we have an entire rack of servers dedicated to the Kite ecosystem. Again, top of the line hardware (32 core CPUs, 32 GB RAM ...). All your requests to Kite go here, and from here, to NetMagic over peer-to-peer leased lines.

OMS issues
--------------------
On Friday (1st July), several people from TR's upper management (including TR's CTO and South Asia director) were at Zerodha's Bangalore HQ to discuss these issues. Co-incidentally and ironically, this is when the major downtime happened. They were in our office for close to five hours dealing with the meltdown. They did whatever they do when their OMS layer supplied to us goes down.

They have promised us swift actions and resolutions--better team, better monitoring, revamping of their technology ... However, for a monolithic organisation like TR with legacy systems, we're well aware that things are easier said than done. That said, a team has been formed with people from both sides to resolve these recurring technical issues and we're doing our best to push things fprward. We're awaiting major fixes from TR in the coming days.

Quality OMS (or for that matter, most enterprise tech) are almost non-existent in India. Nonetheless, we've started looking for alternate vendors we could use in conjunction with our current setup, exclusively for Kite Connect APIs (although a 2-vendor OMS setup is going to be complicated technologically, operationally, and monetarily).

To add, Zerodha has also made a strategic investment in an OMS startup, but honestly, any potential integration there would be quite some time away.

Network issues
-------------------------
As I mentioned, we have spared no expenses when it comes to hardware and infrastructure--top-of-the-line servers, switches, hardware firewalls, massive amounts of dedicated bandwidth. In terms of hardware infrastructure and raw computational power, at any given moment, we've probably got 10x spare capacity. With the way Kite Connect itself is built, we can handle requests many times over of what we're serving now (but TR's OMS layer exposed to us is the bottleneck there).

At the ILFS DC, the issues we've faced have been due to network inconsistencies. We have a combination of TATA+Airtel lines there (BGP), but there have been "mysterious" flaps. The DC has been unable to give us any satisfactory answers and the investigation continues. On Saturday (9th July), the DC is commissioning an even better hardware firewall exclusively for the Kite rack, just to be sure that the network glitches are not actually at the firewall level.

Last Tuesday (5th July), the Amazon Cloud team was in our Bangalore office discussing their new offerings. Although drastic, moving to the "cloud" and ditching our dedicated physical hardware is also something we're evaluating, if that's what it takes. This new route has opened up only because Amazon launched their Mumbai operations a couple weeks ago, and because they allow direct leased line connectivity.

Why don't other brokers have these issues?
----------------------------------------------------------------------
This is something that has puzzled us constantly and we've personally pored into. We've spoken to OMS vendors, ISPs, DCs, and exchanges. The answer is, every broker faces issues. The smaller brokers have such little activity that the issues are not noticed. The bigger brokers have systems and redundancies that they've built over decades that the issues are comparatively infrequent. Then again, they are also quite comfortable with the same legacy platforms they've been offering over the years. That TR, one of the biggest OMS vendors in India, is not able to cope with our traffic, to some extent, does not come to us as a surprise. On a reasonably active day, we'll have 40-50,000 live, concurrent users across our platforms. I don't know if any big brokers themselves can claim such levels of concurrent activity. In addition, the absolute lack of any legal recourse Indian regulators afford brokers against technical faults by exchange approved OMS vendors, doesn't help our case one bit.

For me and my tech team (and of course, Nithin and the rest of Zerodha's management), this has been a source of endless frustration and sleeplessness. We genuinely want to just build better technology and change the Indian investmentTech space for good. However, a significant amount of our productive hours is wasted dealing with the aforementioned forces on a daily basis. For customers like you who trust Zerodha, and whose hard earned money is at stake, I can absolutely understand how horrible an experience these issues are.

To conclude, we are doing our best to come up with resolutions. We're exploring dozens of possibilities, everything from building our own OMS from scratch, setting up our own mini-DCs, looking for other OMS vendors... Anything that is practical, we'll do soon. In the meantime, we're doing pushing the vendors to make the current setup behave drastically better.

Hope that provides some clarity.

Kailash

sudhirshettyk · July 2016

Thank you so much @Kailash for your effort in providing a detailed explanation of the issues and plans. Its quiet comforting to know that this issue is on top of your priority.
One question - you said the instability of a proprietary layer TR has exposed to you for Kite Connect is one major cause of issue . So can we assume that, while we see OMS errors on the API interface , we should still be able to use Kite web interface or PI reliably and we can fall back on these which you implied is interfacing to the TR stable core ? Did i understand it correct ?

Kailash · July 2016

@sudhirshettyk Kite Web and Kite Connect belong to the same ecosystem, so many issues affect them both.

Pi is an entirely different kind of backend, so it should be immune to most issues that happen at the TR API level.

PS: Several major updates have happened over the weekend at TR and ILFS for the better.

Howdy, Stranger!

Categories

In this Discussion

On recent downtime and what we're doing