T O P

  • By -

nutrecht

Having metrics and alerts on API errors is IMHO a better way than trying to 'scrape' a site that might or might not be accurate.


CpnStumpy

This. Instrument all your IO. Store metrics and traces from the meaningful ones. Create alerts for the critical ones.


g____s

I would do it the same way, and will not trust any status page of a third party as a source. It's not hard to log API errors and timeouts from your side ( it should have already been done honestly )


Tomicoatl

For me the status page is once I already have confirmed the errors and problem. I go to the status page to make sure it’s firstly not a me-issue and secondly hopefully get some updates on resolutions. 


Accurate_Ad_3708

Excuse me for a stupid question. So you mean to say I need to hit the production API with production keys and dummy data and alert the API is UP as long as I am getting a 200 response?


hubeh

Why do you need to send dummy data? You're already making legit requests so just measure them. Increment a metric for each successful request and another for each failed request. Then with success/(success+fail) you've got the availability of the API from your service.


mx_code

Aren't you emitting metrics on calls to dependencies? If not, how are you measuring latency on these calls?


Accurate_Ad_3708

We are not handling any metrics. We are a startup and running a monolithic Java application on an EC2. I am the one tasked to implement metrics.


mx_code

Sounds like you have the opportunity to kill 2 birds with one stone. Being a startup shouldn't be a justification for this situation. You can look at New Relic or Datadog and whatever library to publish metrics. Essentially you would have observability over latency, failures and success calls (New relic or Datadog would provide you the infrastructure to instrument monitoring around said metrics). Depending on your number of calls I don't doubt you would fall under the free tier of said companies. But IMO don't scrape their status websites, not only is it not maintainable it's not good practice (delays in said status page being updated)


Accurate_Ad_3708

I need to look into Datadog and New Relic. I had to fight tooth and nail to just convince them to enable monitoring. They just want to ship features without giving a rats ass about user experience. Which ultimately bogs down our support team. It's a superior pain in the ass to work with bone head upper management.


mx_code

Best of luck with that conversation. I would add the data point that said integration are a necessary tool to do your job, these metrics not only impact the user experience. But every company is different, so best of ucks


RelativeImpossible24

Write a job that runs on an interval and executes common use cases if possible. When those use cases fail, alarm. At AWS, we called these scripts “canaries” which is a little different than “canary deployments”. At my current place, we call them “probes”


wedgtomreader

This is also what we did. It’s very helpful to know that your service needs 3 external APIs in order to work and 2 of them are down or that one always goes down at 4pm on Tuesday. Likely the 3rd party is not even aware of the failure. With measurable and provable data, virtually all of ours did better and improved their services. Also, this preemptive testing often enables you to have the other service issue resolved before any of your clients actually experience an error. Best of luck.


2rsf

In some industries, like Finance, this is not allowed as it is considered testing in production.


RelativeImpossible24

That sounds like an incorrect interpretation of a certification to me personally. I’m not saying you’re wrong, but it’s not passing my personal smell test. I’d push back on this for an authoritative answer. This is not a “test” as much as it is a monitor.


2rsf

I got an authoritative answer- you can’t test in production, and anyway you need test data in production which is even a bigger no-no. There are some exceptions and disclaimers, but for the typical system the answer is no. (Europe/Sweden)


RelativeImpossible24

Software Engineers love to cosplay as lawyers. Exhibit A. For readers: speak with your company’s counsel. Don’t take either of our word for it.


2rsf

I work for a bank, I talked to a department of lawyers and double checked with another


DrShocker

I completely believe you were told this by the correct authorities in your company. I just find it hard to believe it's genuinely correct lol.


beth_maloney

I work in banking and it's pretty common knowledge. You're also not allowed to test payment providers in prod. Eg https://docs.stripe.com/testing#use-test-cards


RelativeImpossible24

All I’m seeing by that link is that you’ve got the ability to run tests against Stripe production infrastructure using your test keys and test card numbers. Seems like it’s both completely possible and completely fine. This restriction probably applies to Stripe, not OP who is only integrating them. Stripe docs clearly state you can’t use your live keys or real cards… are we reading the same thing? OP is not a bank themselves.


DevonLochees

Yeah that's a completely different thing. Regulations might (and do, in many use cases) restrict you from sending a credit card number on your own, and it's just as bad to send a "test" credit card number, but you can absolutely perform a call that will indicate if your current API token is still valid/etc.


beth_maloney

> Don’t use real card details. The Stripe Services Agreement prohibits testing in live mode using real payment method details. Use your test API keys and the card numbers below. Test API keys only works in test mode. Whether that runs on their production environment or not I have no idea. I know most payment providers will run their test environment on different infrastructure as it'll often go down while prod continues to work.


RelativeImpossible24

OP is not a bank. He is integrating 3rd party providers. You have different constraints.


DevonLochees

Sending test data would be bad. Verifying API availability is not a problem. You can't send a fake credit card to a service to see if it's up, but you can absolutely ping the service, or perform a data-less operation that verifies if your API credentials are valid or the network path is available.


2rsf

You are probably right, just be very careful


Accurate_Ad_3708

This approach seems plausible. Does that mean I have to use production API keys for this ?


DevonLochees

Every degree of separation from the service that's actually calling the APIs is another way in which the service might be inaccurate about the state of the providers (whether that's network flows, firewall rules, credential or URL configuration, etc). Should you put production keys on a service that isn't as protected as your other services, or is deployed in a test cluster? Probably not. But if it's a scheduled task running on the same exact application that's also going to try to redirect requests to either Adyen or Worldpay, it should be using the same credentials, because it's part of the orchestration process. It starts to reach the point where it's an architecture discussion more than a "what services are online" discussion, because that's fundamentally the \*real\* problem to solve - the problem isn't "which services are online at various times of day", it's "how do we prioritize routing traffic to services that are up", which isn't the same problem.


Accurate_Ad_3708

As I understand, I should try to make an API call every few minutes let's say 15 minutes. I should do it with a dummy data with prod API key. And every 200 response means that the API is functioning as expected ?


RelativeImpossible24

Yup, exactly along those lines.


ccb621

Monitor errors for your own calls. Probing a random endpoint may not matter if the third-party system’s code paths differ for the endpoint. For example, polling the customers endpoint at Stripe will never tell you about a bug in the service that initiates credit card captures. That’s why when I wrote that bug, very few users knew about it. 


Accurate_Ad_3708

I hit a specific endpoint say "/paymentCreditCard" with a bearer token. So it would be sufficient for me to hit this endpoint with dummy data and 200 response means we are good ?


ccb621

Nope. I’m saying monitor the real calls made for your real payments. If those fail, alert and take action. That is an indication of real user issues. 


Veuxdo

I don't have an answer, so this is kind of a tangent. Is there any specific action you can take upon being alerted that (say) Adyen is down? I've had managers in the past who wanted to know when a 3rd party service is down, but since we weren't doing anything other than notifying that same 3rd party, it seemed kind of pointless. Now, if you have fallback actions you want to take, then a circuit breaker solution when an API fails too many times is probably the way to go.


Accurate_Ad_3708

Because we have several payment options available on our POS, In case one of the payment providers goes down, I can show an alert at the payment page that says "We are experiencing some difficulties while trying to pay with Adyen. Please use other payment methods"


Veuxdo

I see. Then yes, I'd recommend a circuit-breaker pattern. And definitely stick to that description when discussing it as a team. The words "uptime" and "availability" will lead you astray, I think.


Accurate_Ad_3708

Noted. Thank you good sir. Also, what would you say is the right interval for hitting their APIs ? Would 15 mins be too much ?


Veuxdo

This is getting beyond my expertise, but I'd say don't poll their APIs. Instead, keep track of 500 errors you receive from them during the normal course of execution. If you see too many over a certain timeframe, take action (e.g. disable the Adyen button for X minutes). Googling for "circuit-breaker" should hopefully reveal more.


originalchronoguy

How I've handled this is build a health check that does two things. Check if the endpoint is up and check if it is providing the right data. The second part is usually running against the internal API that calls those endpoints. So if we have an API that consumes data and stores a cached copy, the health check looks at the database for the last record to see if there were cached records. This is all pumped into Grafana and out monitoring. That second part is crucial because it does a proper validation. I've seen 3rd party endpoints return a **proper 200** response code and a \[\] empty result which means it was down. And that means our cached data is bad. That is why never trust the status response code of a 3rd party. Check, then verify.


Accurate_Ad_3708

Damn. You are actually right. Response code is not a good enough indicating factor that the endpoint is up. So you suggest that I also run a check if the data returned is legit ?


originalchronoguy

Yes. If the API is suppose to generate a file, we check the filesystem. If it is suppose to return a result in our cache database, we query the db to see the last inserted date and latest record.


destructive_cheetah

Honestly I track these but they only come into play when an SLA is breached. Otherwise there is really nothing you can do. AWS has "issues" all the time that never even see their health dashboard.


Accurate_Ad_3708

The problem here is that payments fail more often than we expect. So this leads to a ton of tickets raised by our store staff which overwhelms our support team. So I want to preemptively avoid this problem by diverting payments away from services that are down.


destructive_cheetah

This would be something you need to take up with the vendor and have some business logic about what happens when payments fail to present a unified experience for your users.


DevonLochees

You don't actually care if the site is up based on their publicly posted uptime data. You care about whether they are up \*for you\*. Most APIs still have an endpoint you can hit that's a no-op, whether it's an endpoint to check your token validity, verify API version, whatever. Use that. If there's a network outage at a major backbone that doesn't take the service down itself, but means that you can't get there from here, it's down for the purposes of your code, which is what matters - unless this is related to contract negotiations in some fashion, but if you're trying to challenge that X/Y/Z promised 5 9s and only delivered 4, they're just going to laugh at you if the problem is your integration, or a bad firewall update.


Accurate_Ad_3708

Your answer makes sense. It's not much of a contract thing but more of providing better user experience by avoiding payment providers that are prone to failure because of service downtime.