I'm doing a test run of my migration to nhost of my existing project (from Cloud Storage). While trying to upload files to the storage API I get 503 Service Unavailable. I don't see anything useful in the Storage logs (i actually don't even see any of the successful requests from the last hour, which is strange)
Last active 4 months ago
92 replies
13 views
- SH
I'm doing a test run of my migration to nhost of my non-nhost project (from Cloud Storage). While trying to upload files to the storage API I get 503 Service Unavailable. I don't see anything useful in the Storage logs (i actually don't even see any of the successful requests from the last hour, which is strange)
- SH
i see some lines like this:
"operation":{"error":{"code":"already-tracked","error":"view/table already tracked : \"storage.buckets\""
so maybe it's having trouble applying metadata? i'm not sure why it's even trying to apply metadata though, the whole graphql console was up and running before i started running additional scripts - SH
and i can still access the hasura console. is anyone from nhost around to help debug what might be going on? all my upload requests are still resulting in 503 errors
- EL
This usually happens during startup of Storage when it tries to apply its own metadata. As the table is already tracked Hasura returns an error message. That's OK.
- SH
but i'm getting 503s and i can't upload any files
- EL
What's the subdomain?
- SH
noiipockmjzsqmoskfts
- EL
Maybe @david.barroso can jump in and check?
- EL
in US region?
- DA
looking
- DA
mmm, strange, everything is up
- DA
I can see the service up and the healthchecks returning 200
- DA
at least for the last 13min
- DA
could you try to do what you were doing again so we can see if something happens?
- DA
oh, wait, something just happened
- DA
disregard, false alarm
- SH
ok, i'm trying -- the upload is hanging
- DA
I see
- DA
I can see hasura-storage is being terminated due to a spike in RAM consumption
- DA
which is very strange, what is your script doing?
- SH
i'm uploading my files into nhost from my laptop (downloaded from cloud storage from my existing project first)
- DA
how many files are you trying to upload in parallel?
- SH
due to this bug https://github.com/nhost/nhost/issues/1103 i had to monkey patch the upload function so it's using fetch instead of axios
- SH
unfortunately it's completely serial 😦
- SH
i'm not trying to do anything in parallel at all
- SH
it's awaiting and completing each upload before going to the next one
- SH
i successfully uploaded a fair number of other files (also serial, not parallel) before it crashed and stopped responding
- SH
now i'm getting 502 bad gateway -- the first responses were 503 Service Unavailable. i guess it's still terminated?
- DA
https://noiipockmjzsqmoskfts.storage.us-east-1.nhost.run/v1/version
- DA
I don't see any 502
- SH
502 Bad Gateway <h1>502 Bad Gateway</h1> <hr />nginx
- SH
that's the response from my POST to the storage/files
- SH
the script worked for a number of files before failing though
- SH
i have also tried removing the monkey patch but the uploads fail before getting any further due to the
maxBodyLength
bug:Request body larger than maxBodyLength limit
- DA
how large are those files?
- SH
so i can't test if there is something about fetch that would somehow be causing this issue (though i'm not sure how)
- SH
a variety of sizes
- DA
what's the largest? I just want to see if I can reproduce locally
- SH
anywhere from 1MB to 100MBish
- SH
i tihnk 115MB succeeded
- SH
before the whole service went down
- DA
ok, let me see if I can reproduce somehow locally first
- DA
I increased your memory limit so you are not blocked with the migration while I figure this out so feel free to proceed
- SH
thank you
- SH
is there a way to see issues like this in the logs or is it something only visible internally?
- SH
i couldn't really see anything or figure out how to restart if that was a possibility. does a normal deploy restart hasura storage and would it free up RAM?
- SH
hopefully this particular RAM issue is something that can be fixed but just for understanding where to debug
- DA
for now it is just internal, we have plans to expose this information though. Our internal orchestration restarts services if they exceed certain RAM thresholds, which usually frees memory
- SH
locally, i did notice ECONNRESET frequently on the storage container but i thought it was possibly a cli-only issue
- DA
that's usually a good indication of a networking issue or the service being restarted
- DA
in the logs you should see hasura-storage restarting around that time if it was due to that
- DA
something like
time="2022-11-14T08:57:15Z" level=info msg="starting server"
- SH
i started the script again and it's working for now in case it helps you see the kinds of requests
- DA
no worries, it was easy to reproduce locally. The fix is implemented here already: https://github.com/nhost/hasura-storage/pull/133
- DA
I was basically allocating an unnecessarily large buffer for uploads
- SH
were you able to repro the max body length error also by any chance? it seems to be a form-data issue but affects node usage of the library bc file is not available outside the browser
- DA
you mean this? https://discord.com/channels/552499021260914688/1041623658990411787/1041634073224941579
- DA
no, I didn't try either. If it's an SDK think @elitan can probably direct that to a more suitable person, I am afraid I feel a bit handicapped in that respect :P
- EL
Yes I’ve planned to check this week.
- SH
i'm having some trouble again and getting storage timeouts (again i can't see anything in the logs). this part of the script does a promise.all await for a lot of requests, but they are all very small files (like 40 KB or so each)
- SH
nm, seems to have resolved for now
- SH
i know it's on your roadmap but would be awesome to have some of those internal tools/logs exposed someday since the dashboard logs often continue to show 200s when was actually getting 500s or timeouts
- EL
Maybe the 500 is happening earlier than Storage on the ingress Nginx on ur side. If you find a way to reproduce please let us know here or on GitHub.
- EL
I'll also just cc in @Nuno Pato so he's aware of this conversation too
- SH
hi @david.barroso was just wondering if this fix is fully live?
- SH
i am about to run migrations on my production projects now so want to make sure the memory won't cause issues again
- SH
i see it's merged but not sure it's deployed
- DA
no, it isn't yet
- SH
(or if possible could you add some RAM to my subdomains at least temporarily?)
- SH
if i give you the projects?
- DA
give me your projects and I can update your projects
- SH
DMing you
- DA
perfect
- DA
btw, you can verify yourself the version you are running with
https://$subdomain.storage.$region.nhost.run/v1/version
- SH
just to follow up on this, the RAM fix seems to be working. didn't get any timeout or other 500 errors from the storage service during the process
- SH
spoke too soon, i am seeing EPIPE and ETIMEDOUT on the last part of my migration. 😦 this part of my script attempts to upload lots of small files using Promise.all
- SH
guessing it can't handle high numbers of simultaneous requests yet…i'll start to throttle on my end but would be good to know what the limits are or get a clearer error before hitting 500s and ETIMEDOUT at the nginx layer
- EL
@david.barroso & @Nuno Pato Let's look at this together on Monday. Maybe together with @sheena if we want to live debug this.
- DA
I don't think there is much to debug here. Uploading files requires a buffer of 16MB at most so if you are trying to upload files uncontrollably you may start seeing issues when reaching 25-30 concurrent uploads depending on other uses of the service and the file size (25-30 is the worst case without no other use of the service). The solution to this is either increase RAM for the service or add replicas of the service. I just confirmed this by easily uploading 25 x 100 MB files in parallel without exceeding 400MB of total RAM usage in the service
- DA
unless anyone is seeing something different, of course
- SH
if that's the case, it would still be good to get better errors instead of 500s before the storage layer from this or at least document the expectations and limitations of nhost infra/storage so people can know how to alter their code to work with nhost defaults. (e.g. i am coming from using google cloud storage where the same upload code worked without any timeouts…understandable that google infra can handle much higher numbers of simultaneous uploads than nhost can at the moment, but still useful for users to know what the limits are)
- DA
re errors, you are getting a 503, right? That's the correct error code (service is unavailable as it's being restarted). In any case, I agree you need better visibility, we are working on giving you access to memory/CPU usage so you can see for yourself what's happening here. Also, it is not about google infra being able to handle higher requests (which they can) it is about the resources we are giving you. If you need more we can give you more. Right now we only have on our website the free and pro plans but we can give you a custom plan with more RAM or CPU or disk if needed
- DA
another solution to the 503s would be to able to "stall" or "reject" requests based on current usage of the service and the expected requirements of the current request. This is a bit more complex than just a regular rate-limit as not all requests have the same memory requirements
- DA
btw, you shouldn't see this type of problem getting files, just uploading them (or doing image manipulation stuff if you are). Not sure if you have seen these blog posts but this may be of interest to you:
https://nhost.io/blog/hasura-storage-in-go-5x-performance-increase-and-40-percent-less-ram
https://nhost.io/blog/launching-nhost-cdn-nhost-storage-is-now-blazing-fast - EL
@david.barroso If this was a RAM usage issue, would that mean Sheena should be able to see Hasura Storage restarting in the logs after she gets 500s?
- DA
yes
- DA
@sheena let me play a bit with the internal cache size tomorrow, we may be able to tweak them a bit. I just need to find a value that doesn't compromise upload performance too much while allowing more (worst case scenario) concurrent uploads
- EL
Sheena, do you see Storage restarting in the logs in the Nhost dashboard after you get 500s?
- SH
i'm trying to find the logs for yesterday but even when i set a time range of a few hours it only shows a handful of logs
- SH
is there a way to search or at least filter by non-200s? the logs are a bit hard to use unless you're watching them live
- SH
is there a way to send them to papertrail or something like that for additional filtering and alerts?
- EL
We don't have any filtering or alerts yet, sorry. Maybe the quickest way to check is to re-upload images and trigger the 500s again
Last active 4 months ago
92 replies
13 views