Requires servo/servo#37045 for deps and config.
Testing: No need for tests to test tests.
Fixes: servo/servo#37041
---------
Signed-off-by: zefr0x <zer0-x.7ty50@aleeas.com>
this is less user friendly as GitHub checks but has bigger content limits
Signed-off-by: sagudev <16504129+sagudev@users.noreply.github.com>
Co-authored-by: Martin Robinson <mrobinson@igalia.com>
This change adds and alternate method for triggering try changes.
Instead of comments, changes are triggered via applying labels to pull
requests. The action will remove the label from the request and start
the requested jobs.
This will require creating at least a few labels:
- T-full
- T-linux-wpt-2013
- T-linux-wpt-2020
- T-macos
- T-windows
More labels can be added as we support more configurations.
The good thing about this change is that try jobs against the actual
branch in the pull request instead of the master branch. This means
that changes to CI can be tested (unlike for comment processing).
One bit caveat with this change is that when adding multiple labels, a
CI job is triggered for each. Only one real build will run for each
label, but whether or more try jobs is triggered is a race condition.
The first CI job to successfully remove the label will actually trigger
the job. If the same job removes two compatible labels, then they can
share a build (for instance two types of WPT linux jobs). If not there
will be two. Note that this is at least as efficient as the current
behavior.
There are currently two ways to run try. One is to push to the `try` or
`try-*` branches and the other is to trigger a workflow via GitHub
comment. This change combines these methods into one workflow. In
addition, WPT results are reported together rather than separately and
filtered results for all WPT tests are bundled together in the same
artifact.
This is the last piece of the puzzle to turning off bors. This makes
functionality provided by bors to understand "@bors-servo try" a GitHub
Action. For now the syntax is more or less the same, but we can modify
it in the future and even add support for custom configuration options
(more specific build combinations or even passing compiler flags).
The big difference between this and what bors does is that there is no
merge commit. GitHub simply runs tests on the version of the branch that
is on a pull request. There is always the risk that tests might start
failing when a branch is rebased, but this offers a bit more control
because you can easily rebase from the PR and the merge queue will check
this as well.
Also fix report_aggregated_expected_results.py which was reporting an
error when there were no failing tests. This is more commonly an issue
with Layout 2020 because if runs fewer tests and was causing builds to
show up as failing even when they were not.
When doing a try run, bors will often push the last closed merge onto
the branch before pushing the change to try. This means that test
results get reported on closed PRs. There are two issues with this:
1. Doing too much work on the bots.
2. Extra results on closed PRs.
This changes fixes the second issue.
Fixes#29583.
Make WPT results output more useful
Before when a subtest failed, the text of the failed assertion was not printed. This changes makes sure that it is printed in both the console and the aggregated test output.
Also fix a couple typing errors.
<!-- Please describe your changes on the following line: -->
---
<!-- Thank you for contributing to Servo! Please replace each `[ ]` by `[X]` when the step is complete, and replace `___` with appropriate data: -->
- [x] `./mach build -d` does not report any errors
- [x] `./mach test-tidy` does not report any errors
- [x] These changes do not require tests because these are improvements to build tools.
<!-- Also, please make sure that "Allow edits from maintainers" checkbox is checked, so that we can help you if you get stuck somewhere along the way.-->
<!-- Pull requests that do not address these steps are welcome, but they will require additional verification as part of the review process. -->
Before when a subtest failed, the text of the failed assertion was not
printed. This changes makes sure that it is printed in both the console
and the aggregated test output.
Also fix a couple typing errors.
There are two kinds of flaky/intermittent tests in Servo. The
traditional kind is the test that fails on the CI, but has an associated
bug indicating that the test is an intermittent failure. Many of these
tests have completely unstable results, for instance those where an
unpredictable set of subtests fail. It's impossible to generate stable
results for these, so we have traditionally simply discard these
unexpected results.
Another kind of intermittent test is one that will produce an expected
result when rerun (ie will flake). Some of these are also labeled with
bugs, while some are not. In some cases, there is flakiness in some core
Servo functionality that can lead to *any* test flaking, such as a race
condition that can lead to an early screenshot for reftests. When these
kinds of tests do not have associated bugs, they cause the CI to fail.
In this case, it is impossible to label these tests as intermittent
because it can literally be any test.
This change, reruns failed tests in order to detect unlabeled tests in
the second category. Instead of blocking the CI when the second run
leads to expected results, the CI will now pass, but the flake will be
reported to the new flakiness dashboard. This prevents unrelated flakes
from slowing down the merge queue.
Use the new intermittent dashboard to report intermittents and get
information about open bugs. This is now used to filter out
known-intermittents from results. In addition, this also allows the
scripts to report bug information to the GitHub. Display that in all
output.
After filtering intermittents, output the results as JSON. Update the
GitHub workflow to aggregate this JSON data into an artifact and use the
aggregated data to generate a GitHub comment with details about the try
run. The idea here is that this comment will make it easier to track
intermittent tests and notice when a change affects a test marked as
intermittent -- either causing it to permanently fail or fixing it.