… since other time-sensitive tasks depend on them.
Note: we need to be careful with task priorities,
especially in worker pools with limited capacity,
since they are absolute and can cause starvation:
https://docs.taskcluster.net/docs/manual/tasks/priority
## Before this
Before this PR, we had roughly as many chunks as available workers.
Because the the number of test files is a poor estimate for the time
needed to run them, we have significant variation in the completion time
between chunks when testing one given PR.
https://github.com/servo/taskcluster-config/pull/9 adds a tool to collect
this data. Here are two full runs of `test_wpt` before this PR:
https://community-tc.services.mozilla.com/tasks/groups/DBt9ki9gTdWmwAk-VDorzw
```
count 1, total 0:00:32, max: 0:00:32 docker 0:00:32
count 1, total 0:59:14, max: 0:59:14 macos-disabled-mac1 0:59:14
count 6, total 4:12:16, max: 1:01:14 macos-disabled-mac1 WPT 0:40:29 0:18:55 0:46:50 0:44:38 1:01:14 0:40:10
count 1, total 0:55:19, max: 0:55:19 macos-disabled-mac9 0:55:19
count 6, total 4:25:09, max: 1:01:40 macos-disabled-mac9 WPT 0:37:58 0:37:24 0:27:18 1:01:40 0:46:17 0:54:31
```
Times for a given chunk vary between 19 minutes and 61 minutes.
Assuming no `try` testing, with Homu’s serial scheduling of `r+` testing
this means that that worker sits idle for 42 minutes
and our limited CPU resources are under-utilized.
When there *are* `try` PRs being tested however, they compete with
each other and any `r+` PR for the same workers. If we get unlucky,
a 61 minute task could only *start* after some other tasks have finished,
Increasing the overall time-to-merge a lot.
## This
This PR changes the number of chunks to be significantly more
than the number of available workers. When one of them finishes,
that worker can pick up another one instead of sitting idle.
Now the ratio of number of tasks to number of workers doesn’t matter:
the differences in run time between tasks becomes somewhat of an advantage
and the distribution to workers evens out on average.
The number 30 is a bit arbitrary. A higher number reduces resource
under-utilization, but increases the effect of per-task overhead.
The git cache added in https://github.com/servo/servo/pull/24753
reduced that overhead, though.
Another worry I had was whether this would make worse the similar problem
of unequal scheduling between processes within a task,
where some CPU cores sit idle while the rest processes finish their
assigned work.
This turned out not to be enough of a problem to negatively affect
the total machine time:
https://community-tc.services.mozilla.com/tasks/groups/VnDac92HQU6QmrpzWPCR2w
```
count 1, total 0:00:48, max: 0:00:48 docker 0:00:48
count 1, total 0:39:04, max: 0:39:04 macos-disabled-mac9 0:39:04
count 31, total 4:03:29, max: 0:15:29 macos-disabled-mac9 WPT
0:07:26 0:08:39 0:04:21 0:07:13 0:12:47 0:10:11 0:04:01 0:03:36
0:10:43 0:12:57 0:04:47 0:04:06 0:10:09 0:12:00 0:12:42 0:04:40
0:04:24 0:12:20 0:12:15 0:03:03 0:07:35 0:11:35 0:07:01 0:04:16
0:09:40 0:05:08 0:05:01 0:06:29 0:15:29 0:02:28 0:06:27
```
(4h03min is even lower than above, but seems within variation.)
## After this
https://github.com/servo/servo/issues/23655 proposes automatically
restarting failed WPT tasks, in case the failure is intermittent.
With the test suite split into more chunks we have fewer tests per chunk,
and therefore lower probability that a given one fails.
Restarting one of them also causes less repeated work.
Extend nightly WPT update timeout by an hour.
Jobs have been timing out more than usual recently, and on machines that don't have any clear resource hogging going on.
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/24268)
<!-- Reviewable:end -->