mirror of
https://github.com/servo/servo.git
synced 2025-06-23 16:44:33 +01:00
209 lines
6.5 KiB
ReStructuredText
209 lines
6.5 KiB
ReStructuredText
The moving parts
|
|
================
|
|
|
|
html5lib consists of a number of components, which are responsible for
|
|
handling its features.
|
|
|
|
|
|
Tree builders
|
|
-------------
|
|
|
|
The parser reads HTML by tokenizing the content and building a tree that
|
|
the user can later access. There are three main types of trees that
|
|
html5lib can build:
|
|
|
|
* ``etree`` - this is the default; builds a tree based on ``xml.etree``,
|
|
which can be found in the standard library. Whenever possible, the
|
|
accelerated ``ElementTree`` implementation (i.e.
|
|
``xml.etree.cElementTree`` on Python 2.x) is used.
|
|
|
|
* ``dom`` - builds a tree based on ``xml.dom.minidom``.
|
|
|
|
* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
|
|
API. The performance gains are relatively small compared to using the
|
|
accelerated ``ElementTree`` module.
|
|
|
|
You can specify the builder by name when using the shorthand API:
|
|
|
|
.. code-block:: python
|
|
|
|
import html5lib
|
|
with open("mydocument.html", "rb") as f:
|
|
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
|
|
|
|
When instantiating a parser object, you have to pass a tree builder
|
|
class in the ``tree`` keyword attribute:
|
|
|
|
.. code-block:: python
|
|
|
|
import html5lib
|
|
parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
|
|
document = parser.parse("<p>Hello World!")
|
|
|
|
To get a builder class by name, use the ``getTreeBuilder`` function:
|
|
|
|
.. code-block:: python
|
|
|
|
import html5lib
|
|
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
|
|
minidom_document = parser.parse("<p>Hello World!")
|
|
|
|
The implementation of builders can be found in `html5lib/treebuilders/
|
|
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treebuilders>`_.
|
|
|
|
|
|
Tree walkers
|
|
------------
|
|
|
|
Once a tree is ready, you can work on it either manually, or using
|
|
a tree walker, which provides a streaming view of the tree. html5lib
|
|
provides walkers for all three supported types of trees (``etree``,
|
|
``dom`` and ``lxml``).
|
|
|
|
The implementation of walkers can be found in `html5lib/treewalkers/
|
|
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
|
|
|
|
Walkers make consuming HTML easier. html5lib uses them to provide you
|
|
with has a couple of handy tools.
|
|
|
|
|
|
HTMLSerializer
|
|
~~~~~~~~~~~~~~
|
|
|
|
The serializer lets you write HTML back as a stream of bytes.
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> import html5lib
|
|
>>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich')
|
|
>>> walker = html5lib.getTreeWalker("etree")
|
|
>>> stream = walker(element)
|
|
>>> s = html5lib.serializer.HTMLSerializer()
|
|
>>> output = s.serialize(stream)
|
|
>>> for item in output:
|
|
... print("%r" % item)
|
|
'<p'
|
|
' '
|
|
'xml:lang'
|
|
'='
|
|
'pl'
|
|
'>'
|
|
'Witam wszystkich'
|
|
|
|
You can customize the serializer behaviour in a variety of ways, consult
|
|
the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
|
|
documentation.
|
|
|
|
|
|
Filters
|
|
~~~~~~~
|
|
|
|
You can alter the stream content with filters provided by html5lib:
|
|
|
|
* :class:`alphabeticalattributes.Filter
|
|
<html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
|
|
tags to be in alphabetical order
|
|
|
|
* :class:`inject_meta_charset.Filter
|
|
<html5lib.filters.inject_meta_charset.Filter>` sets a user-specified
|
|
encoding in the correct ``<meta>`` tag in the ``<head>`` section of
|
|
the document
|
|
|
|
* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
|
|
``LintError`` exceptions on invalid tag and attribute names, invalid
|
|
PCDATA, etc.
|
|
|
|
* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
|
|
removes tags from the stream which are not necessary to produce valid
|
|
HTML
|
|
|
|
* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
|
|
unsafe markup and CSS. Elements that are known to be safe are passed
|
|
through and the rest is converted to visible text. The default
|
|
configuration of the sanitizer follows the `WHATWG Sanitization Rules
|
|
<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
|
|
|
|
* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
|
|
collapses all whitespace characters to single spaces unless they're in
|
|
``<pre/>`` or ``textarea`` tags.
|
|
|
|
To use a filter, simply wrap it around a stream:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import html5lib
|
|
>>> from html5lib.filters import sanitizer
|
|
>>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom")
|
|
>>> walker = html5lib.getTreeWalker("dom")
|
|
>>> stream = walker(dom)
|
|
>>> sane_stream = sanitizer.Filter(stream) clean_stream = sanitizer.Filter(stream)
|
|
|
|
|
|
Tree adapters
|
|
-------------
|
|
|
|
Used to translate one type of tree to another. More documentation
|
|
pending, sorry.
|
|
|
|
|
|
Encoding discovery
|
|
------------------
|
|
|
|
Parsed trees are always Unicode. However a large variety of input
|
|
encodings are supported. The encoding of the document is determined in
|
|
the following way:
|
|
|
|
* The encoding may be explicitly specified by passing the name of the
|
|
encoding as the encoding parameter to the
|
|
:meth:`~html5lib.html5parser.HTMLParser.parse` method on
|
|
``HTMLParser`` objects.
|
|
|
|
* If no encoding is specified, the parser will attempt to detect the
|
|
encoding from a ``<meta>`` element in the first 512 bytes of the
|
|
document (this is only a partial implementation of the current HTML
|
|
5 specification).
|
|
|
|
* If no encoding can be found and the chardet library is available, an
|
|
attempt will be made to sniff the encoding from the byte pattern.
|
|
|
|
* If all else fails, the default encoding will be used. This is usually
|
|
`Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is
|
|
a common fallback used by Web browsers.
|
|
|
|
|
|
Tokenizers
|
|
----------
|
|
|
|
The part of the parser responsible for translating a raw input stream
|
|
into meaningful tokens is the tokenizer. Currently html5lib provides
|
|
two.
|
|
|
|
To set up a tokenizer, simply pass it when instantiating
|
|
a :class:`~html5lib.html5parser.HTMLParser`:
|
|
|
|
.. code-block:: python
|
|
|
|
import html5lib
|
|
from html5lib import sanitizer
|
|
|
|
p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
|
|
p.parse("<p>Surprise!<script>alert('Boo!');</script>")
|
|
|
|
HTMLTokenizer
|
|
~~~~~~~~~~~~~
|
|
|
|
This is the default tokenizer, the heart of html5lib. The implementation
|
|
can be found in `html5lib/tokenizer.py
|
|
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.
|
|
|
|
HTMLSanitizer
|
|
~~~~~~~~~~~~~
|
|
|
|
This is a tokenizer that removes unsafe markup and CSS styles from the
|
|
input. Elements that are known to be safe are passed through and the
|
|
rest is converted to visible text. The default configuration of the
|
|
sanitizer follows the `WHATWG Sanitization Rules
|
|
<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
|
|
|
|
The implementation can be found in `html5lib/sanitizer.py
|
|
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.
|