submitted algorithm failed

submitted algorithm failed  

  By: Flute on Jan. 15, 2022, 10:41 a.m.

Hi, I tried my algorithm use local try-out and it works fine but after submitting to leaderboard it fails. I am wondering could I get the log of the failure to debug? Thx!

Flute

Re: submitted algorithm failed  

  By: ecemsogancioglu on Jan. 15, 2022, 8:11 p.m.

Hi,

The error is below:

2022-01-15T09:39:51+00:00 0%| | 0/115 [00:00<?, ?it/s] 65%|██████▌ | 75/115 [00:00<00:00, 744.91it/s] 100%|██████████| 115/115 [00:00<00:00, 745.72it/s] 2022-01-15T09:39:52+00:00 ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 2022-01-15T09:39:55+00:00 Traceback (most recent call last): 2022-01-15T09:39:55+00:00 File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 236, in feed 2022-01-15T09:39:55+00:00 obj = _ForkingPickler.dumps(obj) 2022-01-15T09:39:55+00:00 File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps 2022-01-15T09:39:55+00:00 cls(buf, protocol).dump(obj) 2022-01-15T09:39:55+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 328, in reduce_storage 2022-01-15T09:39:55+00:00 fd, size = storage._share_fd() 2022-01-15T09:39:55+00:00 RuntimeError: falseINTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/MapAllocator.cpp":300, please report a bug to PyTorch. unable to write to file 2022-01-15T09:39:55+00:00 Traceback (most recent call last): 2022-01-15T09:39:55+00:00 File "process.py", line 228, in 2022-01-15T09:39:55+00:00 Maskrcnnnodecontainer(args.input_dir, args.output_dir, args=args).process() 2022-01-15T09:39:55+00:00 File "/home/algorithm/.local/lib/python3.7/site-packages/evalutils/evalutils.py", line 183, in process 2022-01-15T09:39:55+00:00 self.process_cases() 2022-01-15T09:39:55+00:00 File "/home/algorithm/.local/lib/python3.7/site-packages/evalutils/evalutils.py", line 191, in process_cases 2022-01-15T09:39:55+00:00 self._case_results.append(self.process_case(idx=idx, case=case)) 2022-01-15T09:39:55+00:00 File "process.py", line 215, in process_case 2022-01-15T09:39:55+00:00 scored_candidates = self.predict(input_image=input_image) 2022-01-15T09:39:55+00:00 File "process.py", line 120, in predict 2022-01-15T09:39:55+00:00 args=(args,), 2022-01-15T09:39:55+00:00 File "/opt/algorithm/detectron2/engine/launch.py", line 82, in launch 2022-01-15T09:39:55+00:00 main_func(args) 2022-01-15T09:39:55+00:00 File "process.py", line 68, in test_main 2022-01-15T09:39:55+00:00 res = Trainer.test(cfg, model) 2022-01-15T09:39:55+00:00 File "/opt/algorithm/detectron2/engine/defaults.py", line 624, in test 2022-01-15T09:39:55+00:00 results_i = inference_on_dataset(model, data_loader, evaluator) 2022-01-15T09:39:55+00:00 File "/opt/algorithm/detectron2/evaluation/evaluator.py", line 158, in inference_on_dataset 2022-01-15T09:39:55+00:00 outputs = model(inputs) 2022-01-15T09:39:55+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl 2022-01-15T09:39:55+00:00 return forward_call(input, kwargs) 2022-01-15T09:39:55+00:00 File "/opt/algorithm/detectron2/modeling/meta_arch/rcnn.py", line 146, in forward 2022-01-15T09:39:55+00:00 return self.inference(batched_inputs) 2022-01-15T09:39:55+00:00 File "/opt/algorithm/detectron2/modeling/meta_arch/rcnn.py", line 199, in inference 2022-01-15T09:39:55+00:00 features = self.backbone(images.tensor) 2022-01-15T09:39:55+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl 2022-01-15T09:39:55+00:00 return forward_call(*input, kwargs) 2022-01-15T09:39:55+00:00 File "/opt/algorithm/detectron2/modeling/backbone/fpn.py", line 126, in forward 2022-01-15T09:39:55+00:00 bottom_up_features = self.bottom_up(x) 2022-01-15T09:39:55+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl 2022-01-15T09:39:55+00:00 return forward_call(input, kwargs) 2022-01-15T09:39:55+00:00 File "/opt/algorithm/detectron2/modeling/backbone/resnet.py", line 445, in forward 2022-01-15T09:39:55+00:00 x = self.stem(x) 2022-01-15T09:39:55+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl 2022-01-15T09:39:55+00:00 return forward_call(input, kwargs) 2022-01-15T09:39:55+00:00 File "/opt/algorithm/detectron2/modeling/backbone/resnet.py", line 356, in forward 2022-01-15T09:39:55+00:00 x = self.conv1(x) 2022-01-15T09:39:55+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl 2022-01-15T09:39:55+00:00 return forward_call(*input, kwargs) 2022-01-15T09:39:55+00:00 File "/opt/algorithm/detectron2/layers/wrappers.py", line 107, in forward 2022-01-15T09:39:55+00:00 x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups 2022-01-15T09:39:55+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler 2022-01-15T09:39:55+00:00 _error_if_any_worker_fails() 2022-01-15T09:39:55+00:00 RuntimeError: DataLoader worker (pid 30) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

Re: submitted algorithm failed  

  By: Flute on Jan. 16, 2022, 9:09 p.m.

Hi Ecem, I submitted a new one but still fails. I am wondering could I bug you again for helping me check the error again? thx!

Flute

Re: submitted algorithm failed  

  By: ecemsogancioglu on Jan. 17, 2022, 10:28 a.m.

Hi,

The error of your last job is below:

2022-01-17T09:32:27+00:00 0%| | 0/115 [00:00<?, ?it/s] 67%|██████▋ | 77/115 [00:00<00:00, 765.01it/s] 100%|██████████| 115/115 [00:00<00:00, 767.53it/s] 2022-01-17T09:32:28+00:00 ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 2022-01-17T09:32:29+00:00 ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 2022-01-17T09:32:35+00:00 /opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.) 2022-01-17T09:32:35+00:00 return _VF.meshgrid(tensors, *kwargs) # type: ignore[attr-defined] 2022-01-17T09:32:44+00:00 Traceback (most recent call last): 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data 2022-01-17T09:32:44+00:00 data = self._data_queue.get(timeout=timeout) 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 113, in get 2022-01-17T09:32:44+00:00 return _ForkingPickler.loads(res) 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd 2022-01-17T09:32:44+00:00 fd = df.detach() 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach 2022-01-17T09:32:44+00:00 with _resource_sharer.get_connection(self._id) as conn: 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection 2022-01-17T09:32:44+00:00 c = Client(address, authkey=process.current_process().authkey) 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 498, in Client 2022-01-17T09:32:44+00:00 answer_challenge(c, authkey) 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 742, in answer_challenge 2022-01-17T09:32:44+00:00 message = connection.recv_bytes(256) # reject large message 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes 2022-01-17T09:32:44+00:00 buf = self._recv_bytes(maxlength) 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes 2022-01-17T09:32:44+00:00 buf = self._recv(4) 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 379, in _recv 2022-01-17T09:32:44+00:00 chunk = read(handle, remaining) 2022-01-17T09:32:44+00:00 ConnectionResetError: [Errno 104] Connection reset by peer 2022-01-17T09:32:44+00:00 2022-01-17T09:32:44+00:00 During handling of the above exception, another exception occurred: 2022-01-17T09:32:44+00:00 2022-01-17T09:32:44+00:00 Traceback (most recent call last): 2022-01-17T09:32:44+00:00 File "process.py", line 228, in 2022-01-17T09:32:44+00:00 Maskrcnnnodecontainer(args.input_dir, args.output_dir, args=args).process() 2022-01-17T09:32:44+00:00 File "/home/algorithm/.local/lib/python3.7/site-packages/evalutils/evalutils.py", line 183, in process 2022-01-17T09:32:44+00:00 self.process_cases() 2022-01-17T09:32:44+00:00 File "/home/algorithm/.local/lib/python3.7/site-packages/evalutils/evalutils.py", line 191, in process_cases 2022-01-17T09:32:44+00:00 self._case_results.append(self.process_case(idx=idx, case=case)) 2022-01-17T09:32:44+00:00 File "process.py", line 215, in process_case 2022-01-17T09:32:44+00:00 scored_candidates = self.predict(input_image=input_image) 2022-01-17T09:32:44+00:00 File "process.py", line 120, in predict 2022-01-17T09:32:44+00:00 args=(args,), 2022-01-17T09:32:44+00:00 File "/opt/algorithm/detectron2/engine/launch.py", line 82, in launch 2022-01-17T09:32:44+00:00 main_func(args) 2022-01-17T09:32:44+00:00 File "process.py", line 68, in test_main 2022-01-17T09:32:44+00:00 res = Trainer.test(cfg, model) 2022-01-17T09:32:44+00:00 File "/opt/algorithm/detectron2/engine/defaults.py", line 624, in test 2022-01-17T09:32:44+00:00 results_i = inference_on_dataset(model, data_loader, evaluator) 2022-01-17T09:32:44+00:00 File "/opt/algorithm/detectron2/evaluation/evaluator.py", line 149, in inference_on_dataset 2022-01-17T09:32:44+00:00 for idx, inputs in enumerate(data_loader): 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in next 2022-01-17T09:32:44+00:00 data = self._next_data() 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data 2022-01-17T09:32:44+00:00 idx, data = self._get_data() 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data 2022-01-17T09:32:44+00:00 success, data = self._try_get_data() 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1014, in _try_get_data 2022-01-17T09:32:44+00:00 fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)] 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1014, in 2022-01-17T09:32:44+00:00 fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)] 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/tempfile.py", line 547, in NamedTemporaryFile 2022-01-17T09:32:44+00:00 (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type) 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/tempfile.py", line 258, in _mkstemp_inner 2022-01-17T09:32:44+00:00 fd = _os.open(file, flags, 0o600) 2022-01-17T09:32:44+00:00 File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler 2022-01-17T09:32:44+00:00 _error_if_any_worker_fails() 2022-01-17T09:32:44+00:00 RuntimeError: DataLoader worker (pid 38) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.