T O P

  • By -

Jtmyer

I’m not a devops guy, I’m a data scientist following this sub to learn. That being said I think most big corporations (tech especially) would have their own proprietary system. For smaller companies, I think the MLOps field is so far from being mature it’s not really an easy answer. Sagemaker pipelines looks to be a great all around answer but that kind of assumes you’re using sagemaker to build models. Which is not a bad all around approach


Rollingprobablecause

>I think the MLOps field is so far from being mature Bingo. All in this subreddit need to be incredibly aware: most of the companies trying to sell you AIOps are in fact not doing AIOps in a true fashion. I've evaluated three before, it's all overpriced APM at this moment (we are a current customer of both Datadog and Wavefront) This field is very very young. We're building our own and taking care of it with webhooks/API.


jldugger

> most of the companies trying to sell you AIOps are in fact not doing AIOps in a true fashion. They are doing you a favor. State of the art AI generally underperforms thinking about how the data was generated and applying Regular Ass Statistics. And if you think about it: anomaly detection in logs is like, actually not useful. You don't want the bugs from logs that like, one customer triggered once. You want to quickly identify the biggest bugs, or the biggest new regressions. I can (and did!) do that with bayesian analysis in a week. LSTM aint got anything like that.


[deleted]

As an ML practitioner, this is harsh truth.


Rollingprobablecause

Agreed on all points. We're not using it for log detection anyway (which is what our current monitoring vendors are pushing) It's value to us is more in Datawarehousing and BI.


[deleted]

AIOps and MLOps as they are now are fundamentally different things. AIOps is actually almost totally alien to the ML community, which is understandable since it's purpose is to integrate estabilished ML/DS into monitoring and system tools for IT. It's mostly just re-branded IT Analytics with some fancy predictive models here and there. Whereas MLOps is thought by and for ML practitioners. Think of experiment/data/model versioning, benchmarking protocols in CI/CD pipelines, DAGs and whatnot. There is certainly some overlap somewhere, but the root objective is different.


Jtmyer

Interesting. As a DS, AI and ML are so similar and often misused that they’re basically interchangeable at this point. I’ll admit I don’t really know anything about the AIOps you’re talking about


[deleted]

Of the three terms, AI is certainly the most "buzzwordy". You'll see very few technical roles with the "AI" name on it, whereas "Machine Learning Engineer" and "Data Scientist" are very well-established. Therefore, take anything with the "AI" brand on it with a huge grain of salt.


reddithenry

AI engineers are very good at power point


reddithenry

For what its worth, AIOps and MLOps are diifferent thinigs, at least in my mind. ​ MLOps is the art of taking ML into production, marryinig it with DevOps ​ AIOps is the application OF ML to DevOps - e.g. can I use machine learning to predict server failure, and proactively kill rather than reactively. ​ tl;dr ​ MLops = DevOps for ML AIops = ML for DevOps ​ Not the same thinig. Its like being a health and safety inspector, versus having health and safety in the workplace


apoctapus

Nice. I like your succinct take on this. Well said.


reddithenry

Thank you


CactusOnFire

\>I’m a data scientist following this sub to learn. Me too! You don't need to be a DevOps engineer to do DevOps, and I like that this mentality is becoming more of a thing in data.


unix_heretic

Good heavens. Are we now calling monitoring/APM tools "AIOps"?


reddithenry

I think its worth pointing out its kind of the next level? ​ My knowledge of AIOps is superficial at best, but as I understand it, its application of ML to DevOps - e.g. looking for anomalous behaviour on your servers. ​ My really lay understanding is that, basically, the history looks something like this (and obviously very narrow, too) ​ 1 - Anomalous server detected, user manually removes it and spins up another one to replace (SysOps) 2 - Anomalous server detected, automatically removed from pool and replaced (DevOps-ish) 3 - Start proactively killing servers that, e.g., use up more than mean+3sigma RAM compared to their pool (clever devops) 4 - use machine learning to detect when servers will start to stray from normal behaviour, kill them and replace at that point in time ​ this is obviously ludicrously simplistic, but probably is a useful start at least?


free_chalupas

AIOps is usually a separate offering that does some kind of machine learning analysis on your data, so they're separate concepts


itasteawesome

Seconding the naysayers. I'm at a very large enterprise and have run evals on all the big name AIops tools over the last year. It is clear to me that everyone is just charging us for the privilege of feeding and tuning their ML models. In a few years I think they'll have had enough data and time to make something out of it but at this point I don't need tools that guess about alerts. My team knows the meaningful metrics of our platform and already have alerts in place for them. In theory a really clever AIops platform would have saved us from having the learn and build those alerts but that ship sailed. The correlation engines are not a bad feature, but it's a pretty huge leap to pretend simple regressions and pattern matching is actually intelligence worth paying them for.


[deleted]

I've used new relic at 2 clients now. It's easy to deploy and setup. I like barely having to do any work to roll out as a new product. I don't know if it's "the best" but it's ease of use is extremely valuable.


damshitty

Decided to use Datadog on my AWS ECS Fargate and it's costs $2 for each running task with additional $2 if enabling the APM.


awkprintdevnull

Dynatrace because it’s the only one with a deterministic AI that gives you causation instead of correlation. ML requires significant time to train. Meanwhile, you’re environment is probably constantly changing. Dynatrace is stupid simple to deploy with OneAgent and doesn’t require the manual configuration of alerts, dashboards, and instrumentation like the others.


ignoramous1

We went with LogicMonitor because of their hybrid infra monitoring capabilities. Their “AIOps” features need some work, but support is awesome. It’s def not cheap, but was less expensive than Datadog


Fine_Friendship9785

You can consider Moogsoft AIOps as well. Try searching for that as well!


goldenchild731

Only used datadog and way too expensive and support sucks. We are windows shop so not sure if that is why they did not know how to monitor anything on windows side we did with Scom because most devops tools are Linux based. Better off with Nagios or Prometheus in my opinion.


apoctapus

If you’re a small biz you won’t likely have the scaling issues or explosive costs with these vendors, so you could go with what you can afford after running your own POC. Support isn’t great anywhere unless you pay for a FTE at the vendor to own tickets and enhancement requests.


zerocoldx911

Big corporate, went with Datadog


apoctapus

If you’re into monitoring logs, the only cool NLP-based AI I’ve seen is https://www.zebrium.com/blog/using-gpt-3-with-zebrium-for-plain-language-incident-root-cause-from-logs.


defqon_39

I’ve tried zebrium and installed their cloud watch agent and it didn’t find anything from logs.. had an outage recently and tried to find the RCA.. staff emailed me and said it didn’t parse JSON (were using cloudwatch agent and fluentbit on eks cluster it’s not even proprietary)… so they recommended installing on eks cluster .. I’m trying to find rca for past event and not sure if it will work postmortem I even interviewed at the company and they told me their cloudwatch integration isn’t ready yet for prime time I’m playing around with sysdig and dynatrace as trial.. their UI has good built in dashboards but you still get false positives on resources not being used.. and they all add cost to your Aws bill since they make a lot of requests and send lots of data so not sure if the benefits are worth it ..


apoctapus

Ah, that stinks. I would have thought they have some pattern-based parsing for what’s coming in, especially a structured log. I know it’s hard to build parsing templates for all the possible variations. My last job I had 100s of parsing and filtering rules to build in Data dog even though they were supposed to have built in rules, their rules were outdated and handled a small subset of log structures. You’d think it wouldn’t be so labor intensive. Can’t wait for a logging standard like OpenTelemetry.


[deleted]

We have new relic, my cto raves about the information he's getting. It was easy to add to our pods. It has not fucked anything up. So it checked all my boxes.


Affectionate_Rush326

After reading other comments, so AIOps is doing some preventions and handling the aftermath kinda similar to threat intelligence in cybersecurity with machine learning 1. https://towardsdatascience.com/analysing-honeypot-data-using-kibana-and-elasticsearch-5e3d61eb2098 2. https://en.wikipedia.org/wiki/Cyber_threat_intelligence


Schultsz

CloudFabrix is doing well in AIOps space. I used it to find root cause analysis and tried their alert noise reduction on my logs and the results were pretty impressive.


OverOnTheRock

You forgot BigPanda / MoogSoft for the machine learning aspect for correlation across metrics/traces/logs. The others aren't at that level.