Logo-amall

My paid auth service is down 503.

Last active 21 days ago

102 replies

2 views

  • AR

    Please help https://dqfvtxcupelpaxvlysfr.auth.us-east-1.nhost.run/v1/signin/provider/google

  • NU

    your auth service is being hammered with requests to /token and is crashing because of memory

  • AR

    what changed? we've not been having this issues before after upgrading, we had even more users in the past

  • AR

    I can't change the app not to hammer since it will take a day for release (it's a chrome extension that goes through verification process).

  • AR

    Can we get some monitoring, visibility in what is going on?

  • NU

    DM

  • AR

    This issue is not resolved

  • PK

    When you say "before after upgrading" < did you upgrade something? Newer CLI version? Newer SDK version?

    Or did something else upgrade you?

    If it's something from your repo that's in a new deployment, you could try rolling it back.

    As mentioned in other channel, Nhost help comes online here when Europe wakes up shortly. I'm sure you'll have help shortly.

  • AR

    We cannot do anything on our side to change this behavior. There is no way "to roll back things" even if we knew what it was. We ask for help on your side until the issue can be found and until we move our services to other place that does not go down.

  • AR

    Our app has been down for more than 5 hours.

  • AR

    We did have an outage similar to this about three weeks ago. We upgraded to premium and things got resolved.

  • AR

    Since then no auth code has been changed on our side

  • AR

    Please kindly scale out the service to serve our traffic until we move our app onto another more reliable service.

  • PK

    I'm just another user trying to help you. Sorry that I can't. There will be someone from NHost along shortly to help I'm sure.

  • AR

    Ah man sorry

  • PK

    NP. Sun's coming up in Europe soon :-), helps on its way then….

  • AR

    I'm going to sleep. It's 1am in Florida right now. Please fix the service. My partner Alex may join discord to continue the conversation. He is in Poland

  • DA

    morning, you are using 4 times the allowed resources for a pro app since last night. I am allowing you now to use up to 8 times the allowed resources. You should consider two things at this point:

    1. Why does your application needs 4x the CPU and RAM that it did 24h ago
    2. A custom plan with custom resources
  • DA

    also, do not issue any update to your nhost project (i.e. do not change any setting or environment variable) unless you want to revert the increase in resources. To make your custom resources permanent you need a custom plan

  • DA

    this feature is coming soon

  • DA

    looks like your resource consumption is going down to normal levels now. If your chrome extension is trying to perform some scheduled action and retry until it succeeds I'd suggest you to implement both an exponential backoff mechanism + a random timer so not all extensions hammer the service at the same time, specially when there is trouble

  • DA

    disregard, you are still using more resources than usual

  • DA

    as things seem to be settling down I am restoring your resource limits to the normal levels. If you wish to make your extra resources permanent let us know

  • ED

    The authorization is not working for us, I just tested it

  • ED

    Any idea on what's still going on?

  • DA

    looking

  • ED

    Thanks

  • DA
  • DA

    massive spike of resource consumption, around 9.45UTC

  • DA

    top graph is CPU and bottom graph is RAM. Green line is hasura and yellow lines auth

  • DA

    increasing your resources temporarily

  • ED

    When should this help us get back live?

  • ED

    https://dqfvtxcupelpaxvlysfr.auth.us-east-1.nhost.run/v1/signin/provider/google - the auth keeps showing the 503 message

  • ED

    I'm not a hacker so not familiar with all the terminology, apologies

  • DA

    ok, just created 3 replicas of hasura and hasura-auth and doubled the RAM for each (meaning you have 6x time the resources you had 24h ago)

  • ED

    Now it shows 502 message

  • ED

    Let me rephrase this - what should we do ASAP to get back live?

  • ED

    Please let me know if you can help!

  • DA

    I am adding even more resources. We are now at 12x the amount of resources from 24h ago

  • DA

    let's see if it holds

  • ED

    Is the issue in just not having enough resources?

  • ED

    What's the turn-around time for me to see if changes help?

  • ED

    Nothing yet, same 502 message

  • DA

    looking

  • ED

    Finally worked for me

  • DA

    do you have something that is triggered based on time?

  • ED

    Thanks. Any way of monitoring that so that the issue of down-time doesn't persist?

  • DA

    ok, something must have changed though, you are using 12x the amount of resources you were using yesterday

  • ED

    That's super weird. And when you refer to resources, what kind of resources are we talking about?

  • DA

    CPU and RAM

  • ED

    Yeah I got that but I mean more specifically what can trigger the increase?

  • ED

    The most common ones

  • DA
  • DA

    to give you an idea, that's the CPU graph of your app for the last 48 hours

  • ED

    Alright, so are there any common reasons for such spikes? Among various clients of yours

  • ED

    What could have potentially causes this?

  • DA

    I am afraid I don't know your business, users or application to know that

  • ED

    Certainly not users who have installed the app

  • ED

    That's why I am asking - I have no fucking clue what could have caused this

  • DA
  • DA

    that is requests per seconds to your service in the last 48 hours

  • DA

    your application has gone from a steady ~70rps to ~2600rps an hour ago

  • DA

    when I have seen this sort of spike it's usually been a bug in the code. Most likely some retry operation without any exponential backoff. Something like:

    while ( true ) {
        resp = query_backend()
        if resp.status_code == 200 {
           break
        }
    }
    

    Basically just keep trying something until it succeeds without any waiting period or something

  • ED

    Thanks! I guess @Artem Vysotsky will take a look at that

  • AR

    Thanks for helping us @david.barroso . By looking at the client code (written by nhost) I think we are getting confused by root cause.

    What I'm seeing:

    1. For a while now, we had a piece of code that calls nhost.auth.refreshSession on each page load. Which should be ok logic. And this hasn't changed for a while (we will make sure this happens less frequently on our end)
    2. Looking at the code that method retries couple of times with one second interval on failure (this code is written by you folks)
    3. So every time when the service goes down, all our clients see error from the /token endpoint, retry multiple times, creating a cascade of requests
    4. This keeps happening util service can handle all the load and return 200 to each /token call

    Based on this. Here are couple of recommendations to your code:

    1. Do exponential backoff inside the auth client on refreshSession call
    2. On certain number of retries, stop trying and LOG OUT the user, so they stop keep trying.

    That being said, the root cause is not us hammering the /token endpoint. It is the service going down, triggering retries. Why the service went down is another question that you guys have to answer.

  • DA

    @Artem Vysotsky would you mind updating the SDK and releasing a new version of your chrome extension? https://github.com/nhost/nhost/releases/tag/%40nhost%2Fhasura-auth-js%401.12.2

  • AR

    I will thanks

  • DA

    when do you think the extensions will be updated? mostly to monitor it and wind down the extra resources

  • AR

    We just published the app to the store. It will take 24 hours to get verified and released

  • AR

    Also, can you help with something else? We just migrated to our own nhost-auth and now your postgresql database is not handling connections

  • AR

    We will eventually migrate the database too, but could you please scale the database (you can scale down the auth service though)

  • AR

    Another option:
    I we have a database ready, but need to point your services temporarily to our database. Is this doable?

  • DA

    I don't understand, from what I am reading you plan to self host, don't you?

  • AR

    Yes, but we only started with self hosting hasura-auth, not the entire stack.

    1. We deployed hasura-auth to GCP, which still points to nhost's postgres. All other services (hasura itself) are on nhost for now.

    2. Now we are serving requests to auth fine, but your database is not big enough to serve the volume of requests.

    It means that we need to migrate everything to GCP, including hasura itself. But we cannot do this right now as the chrome extension app is still pointing to nhost hasura.

  • AR

    So the current plan is (if you can help)

    Phase 1: mitigate postgreSQL bottleneck
    Option: 1

    1. We will roll out a backup onto our own postgresql
    2. We will ask you to point YOUR hasura to our postgresql
    3. We will point OUR hasura-auth to our postgresql
      Option 2
    4. You scale out the postgresql database temperately

    Phase 2:
    Once the postgresql issue mitigated we will roll out our own hasura and will move to our own stack

  • DA

    given your plans my suggestion is to just have our hasura point to your own database

  • DA

    you can do it in the console

  • DA
  • DA

    click on edit and set your own parameters

  • AR

    ah in hasura itself !

  • AR

    Gotcha

  • AC

    Hi @david.barroso I'm facing the same issue, could you please help me out?
    Subdomain: qxyuvkekgwgkjstpyzge
    Region: ap-south-1

  • AC

    Found the issue

  • AC

    User permissions issue

  • AC

    Thanks @david.barroso and @Nuno Pato

  • NU

    @Achu what was the issue you were getting?

  • NU

    were calls to /token returning 500?

  • AC

    My bad that's different issue I resolved.

  • AC

    My actual issue is that when reset password or verify email link is sent to mail I'm getting

  • AC

    @Nuno Pato

  • NU

    can you replicate that now?

  • NU

    can't find anything in the logs

  • AC

    let me try

  • AC

    Done

  • AC

    It was 502

  • NU

    I don't see any 502 in the logs

  • AC
  • AC

    Now getting 500 error when clicking on the reset password link

  • AC

    @Nuno Pato

  • NU

    can you send me the verification url via PM?

  • AC

    Sure

  • AC

    sent

Last active 21 days ago

102 replies

2 views