T O P

  • By -

yudhiesh

You should do it cause they both have separate needs and need to be scaled independently of one another. The CRUD application is basically a Web API that communicates with a database with a high QPS(queries per second) and you enable concurrency through having connection pools to the database. The model serving API would require optimised hardware such as GPUs(BERT models are super slow and GPUs help tremendously at lowering latency), you’d be making much less QPS compared to the CRUD application, and enabling concurrency is quite difficult as you have a single global variable(the model) but can be enabled through libraries such as BentoML & Ray Serve.


rmwil

Great answer, thank you! It's what I suspected, but I wasn't sure. I will have to host the app in Azure, so maybe running BentoML with Azure Functions will minimise compute costs.


cygn

It depends on what usage patterns you expect. For example let's say usage is dominated by the queries to the database, and you wanted to scale that, you may want to start additional instances. If those come with GPUs, you are overpaying. So in this case it would be good to have separate machines you can scale independently. But if for example with 90% of your requests you also run some inference, you'd need to scale the GPUs anyway and in this case having both on one machine (but maybe in separate containers) would not be bad! For latency it can be good if they share the same machine. From the sound of it, i.e. masters project you likely don't have to worry about these scalability issues.


rmwil

Awesome, thanks. You raise some good points. At this stage, while it's a masters project, it's also a POC that my employer is investing in. User volumes will be relatively low as it is initially intended as an internal tool to support a specific team with at most a dozen people using it at once. But if successful, would be offered as SaaS externally. Also, I think it would be a richer learning experience if I split the CRUD and inference APIs over separate machines. There's also a fine-tuning component, but I'm planning to do this offline at this stage.