Logo-amall

QQ on `kubeflow` as metadata store :thread:

Last active 5 days ago

9 replies

1 views

  • AM

    QQ on kubeflow as metadata store :thread:

  • AM

    I have bunch of pipelines running on kubeflow and after a while I get this error:
    ```╭───────────────────── Traceback (most recent call last) ──────────────────────╮
    │ /usr/local/lib/python3.9/site-packages/mlmetadata/metadatastore/metadata_s │
    │ tore.py:213 in _callmethod │ │ │ │ 210 │ else: │ │ 211 │ grpcmethod = getattr(self.metadatastorestub, methodname) │
    │ 212 │ try: │
    │ ❱ 213 │ │ response.CopyFrom(grpcmethod(request, timeout=self.grpctim │ │ 214 │ except grpc.RpcError as e: │ │ 215 │ │ # RpcError code uses a tuple to specify error code and short │ │ 216 │ │ # description. │ │ │ │ /usr/local/lib/python3.9/site-packages/grpc/channel.py:946 in call
    │ │
    │ 943 │ │ │ │ compression=None): │
    │ 944 │ │ state, call, = self.blocking(request, timeout, metadata, cre │ │ 945 │ │ │ │ │ │ │ │ │ waitfor_ready, compression) │
    │ ❱ 946 │ │ return _endunaryresponseblocking(state, call, False, None) │ │ 947 │ │ │ 948 │ def withcall(self, │
    │ 949 │ │ │ │ request, │
    │ │
    │ /usr/local/lib/python3.9/site-packages/grpc/_channel.py:849 in │
    │ _endunaryresponse_blocking │
    │ │
    │ 846 │ │ else: │
    │ 847 │ │ │ return state.response │
    │ 848 │ else: │
    │ ❱ 849 │ │ raise _InactiveRpcError(state) │
    │ 850 │
    │ 851 │
    │ 852 def streamunaryinvocationoperationses(metadata, initialmetadata
    ╰──────────────────────────────────────────────────────────────────────────────╯
    InactiveRpcError: <InactiveRpcError of RPC that terminated with:
    status = StatusCode.RESOURCEEXHAUSTED details = "Received message larger than max (5282925 vs. 4194304)" debugerrorstring = "UNKNOWN:Error received from peer metadata-grpc-service.kubeflow:8080 {grpcmessage:"Received message larger than
    max (5282925 vs. 4194304)", grpcstatus:8, createdtime:"2023-03-17T18:00:48.942206603+00:00"}"
    > │
    │ │
    │ 215 │ │ # RpcError code uses a tuple to specify error code and short │
    │ 216 │ │ # description. │
    │ 217 │ │ #


    │ ❱ 218 │ │ raise makeexception(e.details(), e.code().value[0]) # pyty │
    │ 219 │
    │ 220 def pywrapcc_call(self, method, request, response) -> None: │
    │ 221 │ """Calls method, serializing and deserializing inputs and outputs │
    ╰──────────────────────────────────────────────────────────────────────────────╯
    ResourceExhaustedError: Received message larger than max (5282925 vs. 4194304)```

  • AM

    there are a lot of similar issues I can see on net; here are some hunches:
    • the mysqldb somehow gets unresponsive due to duplicate UUIDI of the run details
    • the default gRPC message limit is 4mb by default and should be increased
    • the mysqlGroupConcatMaxLen parameter of the mysql db should be increased
    • …

  • AM

    I am out of ideas; I was wondering if anyone seen similar things before; so I am all ears :ear:

  • AM
  • AM

    in :point_up: issue, someones said
    @chensun I was not facing this issue initially, but started to face once Kubeflow has more than 5k pipeline runs. We are not logging any metadata outside metadata-writer. Looks like we have to implement pagination on how metadata-writer queries metadata-server. Please see the issue here google/ml-metadata#74 and google/ml-metadata#42
    which is exactly my case;

  • HA

    oh interesting. It seems more like an underlying Kubeflow issue though @amir.benny :disappointed: Not sure how to solve it other than delete the old pipelines?

  • HA

    I suppose one advantage with ZenML would be that you would have the old pipelines at least on the ZenML side

  • AM

    yeah; i gotta see if we can get away with this with some pagination or updating the kubeflow version. Thanks Hamza

Last active 5 days ago

9 replies

1 views