Update html5lib-1.1

2025-07-07 05:31:15 -07:00 · 2021-10-14 22:49:47 -07:00 · 2021-10-14 22:49:47 -07:00 · 586fd15464
commit 586fd15464
parent 3a116486e7
142 changed files with 90234 additions and 2393 deletions
--- a/lib/html5lib/tests/testdata/tokenizer/README.md
+++ b/lib/html5lib/tests/testdata/tokenizer/README.md
@ -0,0 +1,107 @@
+Tokenizer tests
+===============
+
+The test format is [JSON](http://www.json.org/). This has the advantage
+that the syntax allows backward-compatible extensions to the tests and
+the disadvantage that it is relatively verbose.
+
+Basic Structure
+---------------
+
+    {"tests": [
+        {"description": "Test description",
+        "input": "input_string",
+        "output": [expected_output_tokens],
+        "initialStates": [initial_states],
+        "lastStartTag": last_start_tag,
+        "errors": [parse_errors]
+        }
+    ]}
+
+Multiple tests per file are allowed simply by adding more objects to the
+"tests" list.
+
+Each parse error is an object that contains error `code` and one-based
+error location indices: `line` and `col`.
+
+`description`, `input` and `output` are always present. The other values
+are optional.
+
+### Test set-up
+
+`test.input` is a string containing the characters to pass to the
+tokenizer. Specifically, it represents the characters of the **input
+stream**, and so implementations are expected to perform the processing
+described in the spec's **Preprocessing the input stream** section
+before feeding the result to the tokenizer.
+
+If `test.doubleEscaped` is present and `true`, then `test.input` is not
+quite as described above. Instead, it must first be subjected to another
+round of unescaping (i.e., in addition to any unescaping involved in the
+JSON import), and the result of *that* represents the characters of the
+input stream. Currently, the only unescaping required by this option is
+to convert each sequence of the form \\uHHHH (where H is a hex digit)
+into the corresponding Unicode code point. (Note that this option also
+affects the interpretation of `test.output`.)
+
+`test.initialStates` is a list of strings, each being the name of a
+tokenizer state which can be one of the following:
+
+-   `Data state`
+-   `PLAINTEXT state`
+-   `RCDATA state`
+-   `RAWTEXT state`
+-   `Script data state`
+-   `CDATA section state`
+
+ The test should be run once for each string, using it
+to set the tokenizer's initial state for that run. If
+`test.initialStates` is omitted, it defaults to `["Data state"]`.
+
+`test.lastStartTag` is a lowercase string that should be used as "the
+tag name of the last start tag to have been emitted from this
+tokenizer", referenced in the spec's definition of **appropriate end tag
+token**. If it is omitted, it is treated as if "no start tag has been
+emitted from this tokenizer".
+
+### Test results
+
+`test.output` is a list of tokens, ordered with the first produced by
+the tokenizer the first (leftmost) in the list. The list must mach the
+**complete** list of tokens that the tokenizer should produce. Valid
+tokens are:
+
+    ["DOCTYPE", name, public_id, system_id, correctness]
+    ["StartTag", name, {attributes}*, true*]
+    ["StartTag", name, {attributes}]
+    ["EndTag", name]
+    ["Comment", data]
+    ["Character", data]
+
+`public_id` and `system_id` are either strings or `null`. `correctness`
+is either `true` or `false`; `true` corresponds to the force-quirks flag
+being false, and vice-versa.
+
+When the self-closing flag is set, the `StartTag` array has `true` as
+its fourth entry. When the flag is not set, the array has only three
+entries for backwards compatibility.
+
+All adjacent character tokens are coalesced into a single
+`["Character", data]` token.
+
+If `test.doubleEscaped` is present and `true`, then every string within
+`test.output` must be further unescaped (as described above) before
+comparing with the tokenizer's output.
+
+xmlViolation tests
+------------------
+
+`tokenizer/xmlViolation.test` differs from the above in a couple of
+ways:
+
+-   The name of the single member of the top-level JSON object is
+    "xmlViolationTests" instead of "tests".
+-   Each test's expected output assumes that implementation is applying
+    the tweaks given in the spec's "Coercing an HTML DOM into an
+    infoset" section.
+
--- a/lib/html5lib/tests/testdata/tokenizer/contentModelFlags.test
+++ b/lib/html5lib/tests/testdata/tokenizer/contentModelFlags.test
@ -0,0 +1,93 @@
+{"tests": [
+
+{"description":"PLAINTEXT content model flag",
+"initialStates":["PLAINTEXT state"],
+"lastStartTag":"plaintext",
+"input":"<head>&body;",
+"output":[["Character", "<head>&body;"]]},
+
+{"description":"PLAINTEXT with seeming close tag",
+"initialStates":["PLAINTEXT state"],
+"lastStartTag":"plaintext",
+"input":"</plaintext>&body;",
+"output":[["Character", "</plaintext>&body;"]]},
+
+{"description":"End tag closing RCDATA or RAWTEXT",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo</xmp>",
+"output":[["Character", "foo"], ["EndTag", "xmp"]]},
+
+{"description":"End tag closing RCDATA or RAWTEXT (case-insensitivity)",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo</xMp>",
+"output":[["Character", "foo"], ["EndTag", "xmp"]]},
+
+{"description":"End tag closing RCDATA or RAWTEXT (ending with space)",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo</xmp ",
+"output":[["Character", "foo"]],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 10 }
+]},
+
+{"description":"End tag closing RCDATA or RAWTEXT (ending with EOF)",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo</xmp",
+"output":[["Character", "foo</xmp"]]},
+
+{"description":"End tag closing RCDATA or RAWTEXT (ending with slash)",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo</xmp/",
+"output":[["Character", "foo"]],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 10 }
+]},
+
+{"description":"End tag not closing RCDATA or RAWTEXT (ending with left-angle-bracket)",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo</xmp<",
+"output":[["Character", "foo</xmp<"]]},
+
+{"description":"End tag with incorrect name in RCDATA or RAWTEXT",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"</foo>bar</xmp>",
+"output":[["Character", "</foo>bar"], ["EndTag", "xmp"]]},
+
+{"description":"Partial end tags leading straight into partial end tags",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"</xmp</xmp</xmp>",
+"output":[["Character", "</xmp</xmp"], ["EndTag", "xmp"]]},
+
+{"description":"End tag with incorrect name in RCDATA or RAWTEXT (starting like correct name)",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"</foo>bar</xmpaar>",
+"output":[["Character", "</foo>bar</xmpaar>"]]},
+
+{"description":"End tag closing RCDATA or RAWTEXT, switching back to PCDATA",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo</xmp></baz>",
+"output":[["Character", "foo"], ["EndTag", "xmp"], ["EndTag", "baz"]]},
+
+{"description":"RAWTEXT w/ something looking like an entity",
+"initialStates":["RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"&foo;",
+"output":[["Character", "&foo;"]]},
+
+{"description":"RCDATA w/ an entity",
+"initialStates":["RCDATA state"],
+"lastStartTag":"textarea",
+"input":"&lt;",
+"output":[["Character", "<"]]}
+
+]}
--- a/lib/html5lib/tests/testdata/tokenizer/domjs.test
+++ b/lib/html5lib/tests/testdata/tokenizer/domjs.test
@ -0,0 +1,330 @@
+{
+    "tests": [
+        {
+            "description":"CR in bogus comment state",
+            "input":"<?\u000d",
+            "output":[["Comment", "?\u000a"]],
+            "errors":[
+                { "code": "unexpected-question-mark-instead-of-tag-name", "line": 1, "col": 2 }
+            ]
+        },
+        {
+            "description":"CRLF in bogus comment state",
+            "input":"<?\u000d\u000a",
+            "output":[["Comment", "?\u000a"]],
+            "errors":[
+                { "code": "unexpected-question-mark-instead-of-tag-name", "line": 1, "col": 2 }
+            ]
+        },
+        {
+            "description":"CRLFLF in bogus comment state",
+            "input":"<?\u000d\u000a\u000a",
+            "output":[["Comment", "?\u000a\u000a"]],
+            "errors":[
+                { "code": "unexpected-question-mark-instead-of-tag-name", "line": 1, "col": 2 }
+            ]
+        },
+        {
+            "description":"Raw NUL replacement",
+            "doubleEscaped":true,
+            "initialStates":["RCDATA state", "RAWTEXT state", "PLAINTEXT state", "Script data state"],
+            "input":"\\u0000",
+            "output":[["Character", "\\uFFFD"]],
+            "errors":[
+                { "code": "unexpected-null-character", "line": 1, "col": 1 }
+            ]
+        },
+        {
+            "description":"NUL in CDATA section",
+            "doubleEscaped":true,
+            "initialStates":["CDATA section state"],
+            "input":"\\u0000]]>",
+            "output":[["Character", "\\u0000"]]
+        },
+        {
+           "description":"NUL in script HTML comment",
+           "doubleEscaped":true,
+           "initialStates":["Script data state"],
+           "input":"<!--test\\u0000--><!--test-\\u0000--><!--test--\\u0000-->",
+           "output":[["Character", "<!--test\\uFFFD--><!--test-\\uFFFD--><!--test--\\uFFFD-->"]],
+           "errors":[
+               { "code": "unexpected-null-character", "line": 1, "col": 9 },
+               { "code": "unexpected-null-character", "line": 1, "col": 22 },
+               { "code": "unexpected-null-character", "line": 1, "col": 36 }
+           ]
+        },
+        {
+           "description":"NUL in script HTML comment - double escaped",
+           "doubleEscaped":true,
+           "initialStates":["Script data state"],
+           "input":"<!--<script>\\u0000--><!--<script>-\\u0000--><!--<script>--\\u0000-->",
+           "output":[["Character", "<!--<script>\\uFFFD--><!--<script>-\\uFFFD--><!--<script>--\\uFFFD-->"]],
+           "errors":[
+                { "code": "unexpected-null-character", "line": 1, "col": 13 },
+                { "code": "unexpected-null-character", "line": 1, "col": 30 },
+                { "code": "unexpected-null-character", "line": 1, "col": 48 }
+           ]
+        },
+        {
+           "description":"EOF in script HTML comment",
+           "initialStates":["Script data state"],
+           "input":"<!--test",
+           "output":[["Character", "<!--test"]],
+           "errors":[
+               { "code": "eof-in-script-html-comment-like-text", "line": 1, "col": 9 }
+           ]
+        },
+        {
+           "description":"EOF in script HTML comment after dash",
+           "initialStates":["Script data state"],
+           "input":"<!--test-",
+           "output":[["Character", "<!--test-"]],
+           "errors":[
+               { "code": "eof-in-script-html-comment-like-text", "line": 1, "col": 10 }
+           ]
+        },
+        {
+           "description":"EOF in script HTML comment after dash dash",
+           "initialStates":["Script data state"],
+           "input":"<!--test--",
+           "output":[["Character", "<!--test--"]],
+           "errors":[
+               { "code": "eof-in-script-html-comment-like-text", "line": 1, "col": 11 }
+           ]
+        },
+        {
+           "description":"EOF in script HTML comment double escaped after dash",
+           "initialStates":["Script data state"],
+           "input":"<!--<script>-",
+           "output":[["Character", "<!--<script>-"]],
+           "errors":[
+               { "code": "eof-in-script-html-comment-like-text", "line": 1, "col": 14 }
+           ]
+        },
+        {
+           "description":"EOF in script HTML comment double escaped after dash dash",
+           "initialStates":["Script data state"],
+           "input":"<!--<script>--",
+           "output":[["Character", "<!--<script>--"]],
+           "errors":[
+               { "code": "eof-in-script-html-comment-like-text", "line": 1, "col": 15 }
+           ]
+        },
+        {
+           "description":"EOF in script HTML comment - double escaped",
+           "initialStates":["Script data state"],
+           "input":"<!--<script>",
+           "output":[["Character", "<!--<script>"]],
+           "errors":[
+               { "code": "eof-in-script-html-comment-like-text", "line": 1, "col": 13 }
+           ]
+        },
+        {
+            "description":"Dash in script HTML comment",
+            "initialStates":["Script data state"],
+            "input":"<!-- - -->",
+            "output":[["Character", "<!-- - -->"]]
+        },
+        {
+            "description":"Dash less-than in script HTML comment",
+            "initialStates":["Script data state"],
+            "input":"<!-- -< -->",
+            "output":[["Character", "<!-- -< -->"]]
+        },
+        {
+            "description":"Dash at end of script HTML comment",
+            "initialStates":["Script data state"],
+            "input":"<!--test--->",
+            "output":[["Character", "<!--test--->"]]
+        },
+        {
+            "description":"</script> in script HTML comment",
+            "initialStates":["Script data state"],
+            "lastStartTag":"script",
+            "input":"<!-- </script> --></script>",
+            "output":[["Character", "<!-- "], ["EndTag", "script"], ["Character", " -->"], ["EndTag", "script"]]
+        },
+        {
+            "description":"</script> in script HTML comment - double escaped",
+            "initialStates":["Script data state"],
+            "lastStartTag":"script",
+            "input":"<!-- <script></script> --></script>",
+            "output":[["Character", "<!-- <script></script> -->"], ["EndTag", "script"]]
+        },
+        {
+            "description":"</script> in script HTML comment - double escaped with nested <script>",
+            "initialStates":["Script data state"],
+            "lastStartTag":"script",
+            "input":"<!-- <script><script></script></script> --></script>",
+            "output":[["Character", "<!-- <script><script></script>"], ["EndTag", "script"], ["Character", " -->"], ["EndTag", "script"]]
+        },
+        {
+            "description":"</script> in script HTML comment - double escaped with abrupt end",
+            "initialStates":["Script data state"],
+            "lastStartTag":"script",
+            "input":"<!-- <script>--></script> --></script>",
+            "output":[["Character", "<!-- <script>-->"], ["EndTag", "script"], ["Character", " -->"], ["EndTag", "script"]]
+        },
+        {
+            "description":"Incomplete start tag in script HTML comment double escaped",
+            "initialStates":["Script data state"],
+            "lastStartTag":"script",
+            "input":"<!--<scrip></script>-->",
+            "output":[["Character", "<!--<scrip>"], ["EndTag", "script"], ["Character", "-->"]]
+        },
+        {
+            "description":"Unclosed start tag in script HTML comment double escaped",
+            "initialStates":["Script data state"],
+            "lastStartTag":"script",
+            "input":"<!--<script</script>-->",
+            "output":[["Character", "<!--<script"], ["EndTag", "script"], ["Character", "-->"]]
+        },
+        {
+            "description":"Incomplete end tag in script HTML comment double escaped",
+            "initialStates":["Script data state"],
+            "lastStartTag":"script",
+            "input":"<!--<script></scrip>-->",
+            "output":[["Character", "<!--<script></scrip>-->"]]
+        },
+        {
+            "description":"Unclosed end tag in script HTML comment double escaped",
+            "initialStates":["Script data state"],
+            "lastStartTag":"script",
+            "input":"<!--<script></script-->",
+            "output":[["Character", "<!--<script></script-->"]]
+        },
+        {
+            "description":"leading U+FEFF must pass through",
+            "initialStates":["Data state", "RCDATA state", "RAWTEXT state", "Script data state"],
+            "doubleEscaped":true,
+            "input":"\\uFEFFfoo\\uFEFFbar",
+            "output":[["Character", "\\uFEFFfoo\\uFEFFbar"]]
+        },
+        {
+            "description":"Non BMP-charref in RCDATA",
+            "initialStates":["RCDATA state"],
+            "input":"&NotEqualTilde;",
+            "output":[["Character", "\u2242\u0338"]]
+        },
+        {
+            "description":"Bad charref in RCDATA",
+            "initialStates":["RCDATA state"],
+            "input":"&NotEqualTild;",
+            "output":[["Character", "&NotEqualTild;"]],
+            "errors":[
+               { "code": "unknown-named-character-reference", "line": 1, "col": 14 }
+            ]
+        },
+        {
+            "description":"lowercase endtags",
+            "initialStates":["RCDATA state", "RAWTEXT state", "Script data state"],
+            "lastStartTag":"xmp",
+            "input":"</XMP>",
+            "output":[["EndTag","xmp"]]
+        },
+        {
+            "description":"bad endtag (space before name)",
+            "initialStates":["RCDATA state", "RAWTEXT state", "Script data state"],
+            "lastStartTag":"xmp",
+            "input":"</ XMP>",
+            "output":[["Character","</ XMP>"]]
+        },
+        {
+            "description":"bad endtag (not matching last start tag)",
+            "initialStates":["RCDATA state", "RAWTEXT state", "Script data state"],
+            "lastStartTag":"xmp",
+            "input":"</xm>",
+            "output":[["Character","</xm>"]]
+        },
+        {
+            "description":"bad endtag (without close bracket)",
+            "initialStates":["RCDATA state", "RAWTEXT state", "Script data state"],
+            "lastStartTag":"xmp",
+            "input":"</xm ",
+            "output":[["Character","</xm "]]
+        },
+        {
+            "description":"bad endtag (trailing solidus)",
+            "initialStates":["RCDATA state", "RAWTEXT state", "Script data state"],
+            "lastStartTag":"xmp",
+            "input":"</xm/",
+            "output":[["Character","</xm/"]]
+        },
+        {
+            "description":"Non BMP-charref in attribute",
+            "input":"<p id=\"&NotEqualTilde;\">",
+            "output":[["StartTag", "p", {"id":"\u2242\u0338"}]]
+        },
+        {
+            "description":"--!NUL in comment ",
+            "doubleEscaped":true,
+            "input":"<!----!\\u0000-->",
+            "output":[["Comment", "--!\\uFFFD"]],
+            "errors":[
+                { "code": "unexpected-null-character", "line": 1, "col": 8 }
+            ]
+        },
+        {
+            "description":"space EOF after doctype ",
+            "input":"<!DOCTYPE html ",
+            "output":[["DOCTYPE", "html", null, null , false]],
+            "errors":[
+                { "code": "eof-in-doctype", "line": 1, "col": 16 }
+            ]
+        },
+        {
+            "description":"CDATA in HTML content",
+            "input":"<![CDATA[foo]]>",
+            "output":[["Comment", "[CDATA[foo]]"]],
+            "errors":[
+                { "code": "cdata-in-html-content", "line": 1, "col": 9 }
+            ]
+        },
+        {
+            "description":"CDATA content",
+            "input":"foo&#32;]]>",
+            "initialStates":["CDATA section state"],
+            "output":[["Character", "foo&#32;"]]
+        },
+        {
+            "description":"CDATA followed by HTML content",
+            "input":"foo&#32;]]>&#32;",
+            "initialStates":["CDATA section state"],
+            "output":[["Character", "foo&#32; "]]
+        },
+        {
+            "description":"CDATA with extra bracket",
+            "input":"foo]]]>",
+            "initialStates":["CDATA section state"],
+            "output":[["Character", "foo]"]]
+        },
+        {
+            "description":"CDATA without end marker",
+            "input":"foo",
+            "initialStates":["CDATA section state"],
+            "output":[["Character", "foo"]],
+            "errors":[
+                { "code": "eof-in-cdata", "line": 1, "col": 4 }
+            ]
+        },
+        {
+            "description":"CDATA with single bracket ending",
+            "input":"foo]",
+            "initialStates":["CDATA section state"],
+            "output":[["Character", "foo]"]],
+            "errors":[
+                { "code": "eof-in-cdata", "line": 1, "col": 5 }
+            ]
+        },
+        {
+            "description":"CDATA with two brackets ending",
+            "input":"foo]]",
+            "initialStates":["CDATA section state"],
+            "output":[["Character", "foo]]"]],
+            "errors":[
+                { "code": "eof-in-cdata", "line": 1, "col": 6 }
+            ]
+        }
+
+    ]
+}
--- a/lib/html5lib/tests/testdata/tokenizer/entities.test
+++ b/lib/html5lib/tests/testdata/tokenizer/entities.test
@ -0,0 +1,542 @@
+{"tests": [
+
+{"description": "Undefined named entity in a double-quoted attribute value ending in semicolon and whose name starts with a known entity name.",
+"input":"<h a=\"&noti;\">",
+"output": [["StartTag", "h", {"a": "&noti;"}]]},
+
+{"description": "Entity name requiring semicolon instead followed by the equals sign in a double-quoted attribute value.",
+"input":"<h a=\"&lang=\">",
+"output": [["StartTag", "h", {"a": "&lang="}]]},
+
+{"description": "Valid entity name followed by the equals sign in a double-quoted attribute value.",
+"input":"<h a=\"&not=\">",
+"output": [["StartTag", "h", {"a": "&not="}]]},
+
+{"description": "Undefined named entity in a single-quoted attribute value ending in semicolon and whose name starts with a known entity name.",
+"input":"<h a='&noti;'>",
+"output": [["StartTag", "h", {"a": "&noti;"}]]},
+
+{"description": "Entity name requiring semicolon instead followed by the equals sign in a single-quoted attribute value.",
+"input":"<h a='&lang='>",
+"output": [["StartTag", "h", {"a": "&lang="}]]},
+
+{"description": "Valid entity name followed by the equals sign in a single-quoted attribute value.",
+"input":"<h a='&not='>",
+"output": [["StartTag", "h", {"a": "&not="}]]},
+
+{"description": "Undefined named entity in an unquoted attribute value ending in semicolon and whose name starts with a known entity name.",
+"input":"<h a=&noti;>",
+"output": [["StartTag", "h", {"a": "&noti;"}]]},
+
+{"description": "Entity name requiring semicolon instead followed by the equals sign in an unquoted attribute value.",
+"input":"<h a=&lang=>",
+"output": [["StartTag", "h", {"a": "&lang="}]],
+"errors":[
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 11 }
+]},
+
+{"description": "Valid entity name followed by the equals sign in an unquoted attribute value.",
+"input":"<h a=&not=>",
+"output": [["StartTag", "h", {"a": "&not="}]],
+"errors":[
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 10 }
+]},
+
+{"description": "Ambiguous ampersand.",
+"input":"&rrrraannddom;",
+"output": [["Character", "&rrrraannddom;"]],
+"errors":[
+    { "code": "unknown-named-character-reference", "line": 1, "col": 14 }
+]},
+
+{"description": "Semicolonless named entity 'not' followed by 'i;' in body",
+"input":"&noti;",
+"output": [["Character", "\u00ACi;"]],
+"errors":[
+    { "code": "missing-semicolon-after-character-reference", "line": 1, "col": 5 }
+]},
+
+{"description": "Very long undefined named entity in body",
+"input":"&ammmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmp;",
+"output": [["Character", "&ammmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmp;"]],
+"errors":[
+    { "code": "unknown-named-character-reference", "line": 1, "col": 950 }
+]},
+
+{"description": "CR as numeric entity",
+"input":"&#013;",
+"output": [["Character", "\r"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 7 }
+]},
+
+{"description": "CR as hexadecimal numeric entity",
+"input":"&#x00D;",
+"output": [["Character", "\r"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 EURO SIGN numeric entity.",
+"input":"&#0128;",
+"output": [["Character", "\u20AC"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 REPLACEMENT CHAR numeric entity.",
+"input":"&#0129;",
+"output": [["Character", "\u0081"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 SINGLE LOW-9 QUOTATION MARK numeric entity.",
+"input":"&#0130;",
+"output": [["Character", "\u201A"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN SMALL LETTER F WITH HOOK numeric entity.",
+"input":"&#0131;",
+"output": [["Character", "\u0192"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 DOUBLE LOW-9 QUOTATION MARK numeric entity.",
+"input":"&#0132;",
+"output": [["Character", "\u201E"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 HORIZONTAL ELLIPSIS numeric entity.",
+"input":"&#0133;",
+"output": [["Character", "\u2026"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 DAGGER numeric entity.",
+"input":"&#0134;",
+"output": [["Character", "\u2020"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 DOUBLE DAGGER numeric entity.",
+"input":"&#0135;",
+"output": [["Character", "\u2021"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 MODIFIER LETTER CIRCUMFLEX ACCENT numeric entity.",
+"input":"&#0136;",
+"output": [["Character", "\u02C6"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 PER MILLE SIGN numeric entity.",
+"input":"&#0137;",
+"output": [["Character", "\u2030"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN CAPITAL LETTER S WITH CARON numeric entity.",
+"input":"&#0138;",
+"output": [["Character", "\u0160"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 SINGLE LEFT-POINTING ANGLE QUOTATION MARK numeric entity.",
+"input":"&#0139;",
+"output": [["Character", "\u2039"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN CAPITAL LIGATURE OE numeric entity.",
+"input":"&#0140;",
+"output": [["Character", "\u0152"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 REPLACEMENT CHAR numeric entity.",
+"input":"&#0141;",
+"output": [["Character", "\u008D"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN CAPITAL LETTER Z WITH CARON numeric entity.",
+"input":"&#0142;",
+"output": [["Character", "\u017D"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 REPLACEMENT CHAR numeric entity.",
+"input":"&#0143;",
+"output": [["Character", "\u008F"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 REPLACEMENT CHAR numeric entity.",
+"input":"&#0144;",
+"output": [["Character", "\u0090"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LEFT SINGLE QUOTATION MARK numeric entity.",
+"input":"&#0145;",
+"output": [["Character", "\u2018"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 RIGHT SINGLE QUOTATION MARK numeric entity.",
+"input":"&#0146;",
+"output": [["Character", "\u2019"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LEFT DOUBLE QUOTATION MARK numeric entity.",
+"input":"&#0147;",
+"output": [["Character", "\u201C"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 RIGHT DOUBLE QUOTATION MARK numeric entity.",
+"input":"&#0148;",
+"output": [["Character", "\u201D"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 BULLET numeric entity.",
+"input":"&#0149;",
+"output": [["Character", "\u2022"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 EN DASH numeric entity.",
+"input":"&#0150;",
+"output": [["Character", "\u2013"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 EM DASH numeric entity.",
+"input":"&#0151;",
+"output": [["Character", "\u2014"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 SMALL TILDE numeric entity.",
+"input":"&#0152;",
+"output": [["Character", "\u02DC"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 TRADE MARK SIGN numeric entity.",
+"input":"&#0153;",
+"output": [["Character", "\u2122"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN SMALL LETTER S WITH CARON numeric entity.",
+"input":"&#0154;",
+"output": [["Character", "\u0161"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 SINGLE RIGHT-POINTING ANGLE QUOTATION MARK numeric entity.",
+"input":"&#0155;",
+"output": [["Character", "\u203A"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN SMALL LIGATURE OE numeric entity.",
+"input":"&#0156;",
+"output": [["Character", "\u0153"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 REPLACEMENT CHAR numeric entity.",
+"input":"&#0157;",
+"output": [["Character", "\u009D"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 EURO SIGN hexadecimal numeric entity.",
+"input":"&#x080;",
+"output": [["Character", "\u20AC"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 REPLACEMENT CHAR hexadecimal numeric entity.",
+"input":"&#x081;",
+"output": [["Character", "\u0081"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 SINGLE LOW-9 QUOTATION MARK hexadecimal numeric entity.",
+"input":"&#x082;",
+"output": [["Character", "\u201A"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN SMALL LETTER F WITH HOOK hexadecimal numeric entity.",
+"input":"&#x083;",
+"output": [["Character", "\u0192"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 DOUBLE LOW-9 QUOTATION MARK hexadecimal numeric entity.",
+"input":"&#x084;",
+"output": [["Character", "\u201E"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 HORIZONTAL ELLIPSIS hexadecimal numeric entity.",
+"input":"&#x085;",
+"output": [["Character", "\u2026"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 DAGGER hexadecimal numeric entity.",
+"input":"&#x086;",
+"output": [["Character", "\u2020"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 DOUBLE DAGGER hexadecimal numeric entity.",
+"input":"&#x087;",
+"output": [["Character", "\u2021"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 MODIFIER LETTER CIRCUMFLEX ACCENT hexadecimal numeric entity.",
+"input":"&#x088;",
+"output": [["Character", "\u02C6"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 PER MILLE SIGN hexadecimal numeric entity.",
+"input":"&#x089;",
+"output": [["Character", "\u2030"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN CAPITAL LETTER S WITH CARON hexadecimal numeric entity.",
+"input":"&#x08A;",
+"output": [["Character", "\u0160"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 SINGLE LEFT-POINTING ANGLE QUOTATION MARK hexadecimal numeric entity.",
+"input":"&#x08B;",
+"output": [["Character", "\u2039"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN CAPITAL LIGATURE OE hexadecimal numeric entity.",
+"input":"&#x08C;",
+"output": [["Character", "\u0152"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 REPLACEMENT CHAR hexadecimal numeric entity.",
+"input":"&#x08D;",
+"output": [["Character", "\u008D"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN CAPITAL LETTER Z WITH CARON hexadecimal numeric entity.",
+"input":"&#x08E;",
+"output": [["Character", "\u017D"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 REPLACEMENT CHAR hexadecimal numeric entity.",
+"input":"&#x08F;",
+"output": [["Character", "\u008F"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 REPLACEMENT CHAR hexadecimal numeric entity.",
+"input":"&#x090;",
+"output": [["Character", "\u0090"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LEFT SINGLE QUOTATION MARK hexadecimal numeric entity.",
+"input":"&#x091;",
+"output": [["Character", "\u2018"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 RIGHT SINGLE QUOTATION MARK hexadecimal numeric entity.",
+"input":"&#x092;",
+"output": [["Character", "\u2019"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LEFT DOUBLE QUOTATION MARK hexadecimal numeric entity.",
+"input":"&#x093;",
+"output": [["Character", "\u201C"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 RIGHT DOUBLE QUOTATION MARK hexadecimal numeric entity.",
+"input":"&#x094;",
+"output": [["Character", "\u201D"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 BULLET hexadecimal numeric entity.",
+"input":"&#x095;",
+"output": [["Character", "\u2022"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 EN DASH hexadecimal numeric entity.",
+"input":"&#x096;",
+"output": [["Character", "\u2013"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 EM DASH hexadecimal numeric entity.",
+"input":"&#x097;",
+"output": [["Character", "\u2014"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 SMALL TILDE hexadecimal numeric entity.",
+"input":"&#x098;",
+"output": [["Character", "\u02DC"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 TRADE MARK SIGN hexadecimal numeric entity.",
+"input":"&#x099;",
+"output": [["Character", "\u2122"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN SMALL LETTER S WITH CARON hexadecimal numeric entity.",
+"input":"&#x09A;",
+"output": [["Character", "\u0161"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 SINGLE RIGHT-POINTING ANGLE QUOTATION MARK hexadecimal numeric entity.",
+"input":"&#x09B;",
+"output": [["Character", "\u203A"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN SMALL LIGATURE OE hexadecimal numeric entity.",
+"input":"&#x09C;",
+"output": [["Character", "\u0153"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 REPLACEMENT CHAR hexadecimal numeric entity.",
+"input":"&#x09D;",
+"output": [["Character", "\u009D"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN SMALL LETTER Z WITH CARON hexadecimal numeric entity.",
+"input":"&#x09E;",
+"output": [["Character", "\u017E"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Windows-1252 LATIN CAPITAL LETTER Y WITH DIAERESIS hexadecimal numeric entity.",
+"input":"&#x09F;",
+"output": [["Character", "\u0178"]],
+"errors":[
+    { "code": "control-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description": "Decimal numeric entity followed by hex character a.",
+"input":"&#97a",
+"output": [["Character", "aa"]],
+"errors":[
+    { "code": "missing-semicolon-after-character-reference", "line": 1, "col": 5 }
+]},
+
+{"description": "Decimal numeric entity followed by hex character A.",
+"input":"&#97A",
+"output": [["Character", "aA"]],
+"errors":[
+    { "code": "missing-semicolon-after-character-reference", "line": 1, "col": 5 }
+]},
+
+{"description": "Decimal numeric entity followed by hex character f.",
+"input":"&#97f",
+"output": [["Character", "af"]],
+"errors":[
+    { "code": "missing-semicolon-after-character-reference", "line": 1, "col": 5 }
+]},
+
+{"description": "Decimal numeric entity followed by hex character A.",
+"input":"&#97F",
+"output": [["Character", "aF"]],
+"errors":[
+    { "code": "missing-semicolon-after-character-reference", "line": 1, "col": 5 }
+]}
+
+]}
--- a/lib/html5lib/tests/testdata/tokenizer/escapeFlag.test
+++ b/lib/html5lib/tests/testdata/tokenizer/escapeFlag.test
@ -0,0 +1,36 @@
+{"tests": [
+
+{"description":"Commented close tag in RCDATA or RAWTEXT",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo<!--</xmp>--></xmp>",
+"output":[["Character", "foo<!--"], ["EndTag", "xmp"], ["Character", "-->"], ["EndTag", "xmp"]]},
+
+{"description":"Bogus comment in RCDATA or RAWTEXT",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo<!-->baz</xmp>",
+"output":[["Character", "foo<!-->baz"], ["EndTag", "xmp"]]},
+
+{"description":"End tag surrounded by bogus comment in RCDATA or RAWTEXT",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo<!--></xmp><!-->baz</xmp>",
+"output":[["Character", "foo<!-->"], ["EndTag", "xmp"], ["Comment", ""], ["Character", "baz"], ["EndTag", "xmp"]],
+"errors":[
+    { "code": "abrupt-closing-of-empty-comment", "line": 1, "col": 19 }
+]},
+
+{"description":"Commented entities in RCDATA",
+"initialStates":["RCDATA state"],
+"lastStartTag":"xmp",
+"input":" &amp; <!-- &amp; --> &amp; </xmp>",
+"output":[["Character", " & <!-- & --> & "], ["EndTag", "xmp"]]},
+
+{"description":"Incorrect comment ending sequences in RCDATA or RAWTEXT",
+"initialStates":["RCDATA state", "RAWTEXT state"],
+"lastStartTag":"xmp",
+"input":"foo<!-- x --x>x-- >x--!>x--<></xmp>",
+"output":[["Character", "foo<!-- x --x>x-- >x--!>x--<>"], ["EndTag", "xmp"]]}
+
+]}
--- a/lib/html5lib/tests/testdata/tokenizer/namedEntities.test
+++ b/lib/html5lib/tests/testdata/tokenizer/namedEntities.test
--- a/lib/html5lib/tests/testdata/tokenizer/numericEntities.test
+++ b/lib/html5lib/tests/testdata/tokenizer/numericEntities.test
--- a/lib/html5lib/tests/testdata/tokenizer/pendingSpecChanges.test
+++ b/lib/html5lib/tests/testdata/tokenizer/pendingSpecChanges.test
@ -0,0 +1,9 @@
+{"tests": [
+
+{"description":"<!---- >",
+"input":"<!---- >",
+"output":[["Comment","-- >"]],
+"errors":[
+    { "code": "eof-in-comment", "line": 1, "col": 9 }
+]}
+]}
--- a/lib/html5lib/tests/testdata/tokenizer/test1.test
+++ b/lib/html5lib/tests/testdata/tokenizer/test1.test
@ -0,0 +1,349 @@
+{"tests": [
+
+{"description":"Correct Doctype lowercase",
+"input":"<!DOCTYPE html>",
+"output":[["DOCTYPE", "html", null, null, true]]},
+
+
+{"description":"Correct Doctype uppercase",
+"input":"<!DOCTYPE HTML>",
+"output":[["DOCTYPE", "html", null, null, true]]},
+
+{"description":"Correct Doctype mixed case",
+"input":"<!DOCTYPE HtMl>",
+"output":[["DOCTYPE", "html", null, null, true]]},
+
+{"description":"Correct Doctype case with EOF",
+"input":"<!DOCTYPE HtMl",
+"output":[["DOCTYPE", "html", null, null, false]],
+"errors":[
+    { "code": "eof-in-doctype", "line": 1, "col": 15 }
+]},
+
+{"description":"Truncated doctype start",
+"input":"<!DOC>",
+"output":[["Comment", "DOC"]],
+"errors":[
+    { "code": "incorrectly-opened-comment", "line": 1, "col": 3 }
+]},
+
+{"description":"Doctype in error",
+"input":"<!DOCTYPE foo>",
+"output":[["DOCTYPE", "foo", null, null, true]]},
+
+{"description":"Single Start Tag",
+"input":"<h>",
+"output":[["StartTag", "h", {}]]},
+
+{"description":"Empty end tag",
+"input":"</>",
+"output":[],
+"errors":[
+    { "code": "missing-end-tag-name", "line": 1, "col": 3 }
+]},
+
+{"description":"Empty start tag",
+"input":"<>",
+"output":[["Character", "<>"]],
+"errors":[
+    { "code": "invalid-first-character-of-tag-name", "line": 1, "col": 2 }
+]},
+
+{"description":"Start Tag w/attribute",
+"input":"<h a='b'>",
+"output":[["StartTag", "h", {"a":"b"}]]},
+
+{"description":"Start Tag w/attribute no quotes",
+"input":"<h a=b>",
+"output":[["StartTag", "h", {"a":"b"}]]},
+
+{"description":"Start/End Tag",
+"input":"<h></h>",
+"output":[["StartTag", "h", {}], ["EndTag", "h"]]},
+
+{"description":"Two unclosed start tags",
+"input":"<p>One<p>Two",
+"output":[["StartTag", "p", {}], ["Character", "One"], ["StartTag", "p", {}], ["Character", "Two"]]},
+
+{"description":"End Tag w/attribute",
+"input":"<h></h a='b'>",
+"output":[["StartTag", "h", {}], ["EndTag", "h"]],
+"errors":[
+    { "code": "end-tag-with-attributes", "line": 1, "col": 13 }
+]},
+
+{"description":"Multiple atts",
+"input":"<h a='b' c='d'>",
+"output":[["StartTag", "h", {"a":"b", "c":"d"}]]},
+
+{"description":"Multiple atts no space",
+"input":"<h a='b'c='d'>",
+"output":[["StartTag", "h", {"a":"b", "c":"d"}]],
+"errors":[
+    { "code": "missing-whitespace-between-attributes", "line": 1, "col": 9 }
+]},
+
+{"description":"Repeated attr",
+ "input":"<h a='b' a='d'>",
+ "output":[["StartTag", "h", {"a":"b"}]],
+ "errors":[
+    { "code": "duplicate-attribute", "line": 1, "col": 11 }
+]},
+
+{"description":"Simple comment",
+ "input":"<!--comment-->",
+ "output":[["Comment", "comment"]]},
+
+{"description":"Comment, Central dash no space",
+ "input":"<!----->",
+ "output":[["Comment", "-"]]},
+
+{"description":"Comment, two central dashes",
+"input":"<!-- --comment -->",
+"output":[["Comment", " --comment "]]},
+
+{"description":"Comment, central less-than bang",
+"input":"<!--<!-->",
+"output":[["Comment", "<!"]]},
+
+{"description":"Unfinished comment",
+"input":"<!--comment",
+"output":[["Comment", "comment"]],
+"errors":[
+    { "code": "eof-in-comment", "line": 1, "col": 12 }
+]},
+
+{"description":"Unfinished comment after start of nested comment",
+"input":"<!-- <!--",
+"output":[["Comment", " <!"]],
+"errors":[
+    { "code": "eof-in-comment", "line": 1, "col": 10 }
+]},
+
+{"description":"Start of a comment",
+"input":"<!-",
+"output":[["Comment", "-"]],
+"errors":[
+    { "code": "incorrectly-opened-comment", "line": 1, "col": 3 }
+]},
+
+{"description":"Short comment",
+"input":"<!-->",
+"output":[["Comment", ""]],
+"errors":[
+    { "code": "abrupt-closing-of-empty-comment", "line": 1, "col": 5 }
+]},
+
+{"description":"Short comment two",
+"input":"<!--->",
+"output":[["Comment", ""]],
+"errors":[
+    { "code": "abrupt-closing-of-empty-comment", "line": 1, "col": 6 }
+]},
+
+{"description":"Short comment three",
+ "input":"<!---->",
+ "output":[["Comment", ""]]},
+
+{"description":"< in comment",
+"input":"<!-- <test-->",
+"output":[["Comment", " <test"]]},
+
+{"description":"<! in comment",
+"input":"<!-- <!test-->",
+"output":[["Comment", " <!test"]]},
+
+{"description":"<!- in comment",
+"input":"<!-- <!-test-->",
+"output":[["Comment", " <!-test"]]},
+
+{"description":"Nested comment",
+"input":"<!-- <!--test-->",
+"output":[["Comment", " <!--test"]],
+"errors":[
+    { "code": "nested-comment", "line": 1, "col": 10 }
+]},
+
+{"description":"Nested comment with extra <",
+"input":"<!-- <<!--test-->",
+"output":[["Comment", " <<!--test"]],
+"errors":[
+    { "code": "nested-comment", "line": 1, "col": 11 }
+]},
+
+{"description":"< in script data",
+"initialStates":["Script data state"],
+"input":"<test-->",
+"output":[["Character", "<test-->"]]},
+
+{"description":"<! in script data",
+"initialStates":["Script data state"],
+"input":"<!test-->",
+"output":[["Character", "<!test-->"]]},
+
+{"description":"<!- in script data",
+"initialStates":["Script data state"],
+"input":"<!-test-->",
+"output":[["Character", "<!-test-->"]]},
+
+{"description":"Escaped script data",
+"initialStates":["Script data state"],
+"input":"<!--test-->",
+"output":[["Character", "<!--test-->"]]},
+
+{"description":"< in script HTML comment",
+"initialStates":["Script data state"],
+"input":"<!-- < test -->",
+"output":[["Character", "<!-- < test -->"]]},
+
+{"description":"</ in script HTML comment",
+"initialStates":["Script data state"],
+"input":"<!-- </ test -->",
+"output":[["Character", "<!-- </ test -->"]]},
+
+{"description":"Start tag in script HTML comment",
+"initialStates":["Script data state"],
+"input":"<!-- <test> -->",
+"output":[["Character", "<!-- <test> -->"]]},
+
+{"description":"End tag in script HTML comment",
+"initialStates":["Script data state"],
+"input":"<!-- </test> -->",
+"output":[["Character", "<!-- </test> -->"]]},
+
+{"description":"- in script HTML comment double escaped",
+"initialStates":["Script data state"],
+"input":"<!--<script>-</script>-->",
+"output":[["Character", "<!--<script>-</script>-->"]]},
+
+{"description":"-- in script HTML comment double escaped",
+"initialStates":["Script data state"],
+"input":"<!--<script>--</script>-->",
+"output":[["Character", "<!--<script>--</script>-->"]]},
+
+{"description":"--- in script HTML comment double escaped",
+"initialStates":["Script data state"],
+"input":"<!--<script>---</script>-->",
+"output":[["Character", "<!--<script>---</script>-->"]]},
+
+{"description":"- spaced in script HTML comment double escaped",
+"initialStates":["Script data state"],
+"input":"<!--<script> - </script>-->",
+"output":[["Character", "<!--<script> - </script>-->"]]},
+
+{"description":"-- spaced in script HTML comment double escaped",
+"initialStates":["Script data state"],
+"input":"<!--<script> -- </script>-->",
+"output":[["Character", "<!--<script> -- </script>-->"]]},
+
+{"description":"Ampersand EOF",
+"input":"&",
+"output":[["Character", "&"]]},
+
+{"description":"Ampersand ampersand EOF",
+"input":"&&",
+"output":[["Character", "&&"]]},
+
+{"description":"Ampersand space EOF",
+"input":"& ",
+"output":[["Character", "& "]]},
+
+{"description":"Unfinished entity",
+"input":"&f",
+"output":[["Character", "&f"]]},
+
+{"description":"Ampersand, number sign",
+"input":"&#",
+"output":[["Character", "&#"]],
+"errors":[
+    { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 3 }
+]},
+
+{"description":"Unfinished numeric entity",
+"input":"&#x",
+"output":[["Character", "&#x"]],
+"errors":[
+    { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 4 }
+]},
+
+{"description":"Entity with trailing semicolon (1)",
+"input":"I'm &not;it",
+"output":[["Character","I'm \u00ACit"]]},
+
+{"description":"Entity with trailing semicolon (2)",
+"input":"I'm &notin;",
+"output":[["Character","I'm \u2209"]]},
+
+{"description":"Entity without trailing semicolon (1)",
+"input":"I'm &notit",
+"output":[["Character","I'm \u00ACit"]],
+"errors": [
+    {"code" : "missing-semicolon-after-character-reference", "line": 1, "col": 9 }
+]},
+
+{"description":"Entity without trailing semicolon (2)",
+"input":"I'm &notin",
+"output":[["Character","I'm \u00ACin"]],
+"errors": [
+    {"code" : "missing-semicolon-after-character-reference", "line": 1, "col": 9 }
+]},
+
+{"description":"Partial entity match at end of file",
+"input":"I'm &no",
+"output":[["Character","I'm &no"]]},
+
+{"description":"Non-ASCII character reference name",
+"input":"&\u00AC;",
+"output":[["Character", "&\u00AC;"]]},
+
+{"description":"ASCII decimal entity",
+"input":"&#0036;",
+"output":[["Character","$"]]},
+
+{"description":"ASCII hexadecimal entity",
+"input":"&#x3f;",
+"output":[["Character","?"]]},
+
+{"description":"Hexadecimal entity in attribute",
+"input":"<h a='&#x3f;'></h>",
+"output":[["StartTag", "h", {"a":"?"}], ["EndTag", "h"]]},
+
+{"description":"Entity in attribute without semicolon ending in x",
+"input":"<h a='&notx'>",
+"output":[["StartTag", "h", {"a":"&notx"}]]},
+
+{"description":"Entity in attribute without semicolon ending in 1",
+"input":"<h a='&not1'>",
+"output":[["StartTag", "h", {"a":"&not1"}]]},
+
+{"description":"Entity in attribute without semicolon ending in i",
+"input":"<h a='&noti'>",
+"output":[["StartTag", "h", {"a":"&noti"}]]},
+
+{"description":"Entity in attribute without semicolon",
+"input":"<h a='&COPY'>",
+"output":[["StartTag", "h", {"a":"\u00A9"}]],
+"errors": [
+    {"code" : "missing-semicolon-after-character-reference", "line": 1, "col": 12 }
+]},
+
+{"description":"Unquoted attribute ending in ampersand",
+"input":"<s o=& t>",
+"output":[["StartTag","s",{"o":"&","t":""}]]},
+
+{"description":"Unquoted attribute at end of tag with final character of &, with tag followed by characters",
+"input":"<a a=a&>foo",
+"output":[["StartTag", "a", {"a":"a&"}], ["Character", "foo"]]},
+
+{"description":"plaintext element",
+ "input":"<plaintext>foobar",
+ "output":[["StartTag","plaintext",{}], ["Character","foobar"]]},
+
+{"description":"Open angled bracket in unquoted attribute value state",
+ "input":"<a a=f<>",
+ "output":[["StartTag", "a", {"a":"f<"}]],
+ "errors":[
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 7 }
+]}
+
+]}
--- a/lib/html5lib/tests/testdata/tokenizer/test2.test
+++ b/lib/html5lib/tests/testdata/tokenizer/test2.test
@ -0,0 +1,275 @@
+{"tests": [
+
+{"description":"DOCTYPE without name",
+"input":"<!DOCTYPE>",
+"output":[["DOCTYPE", null, null, null, false]],
+"errors":[
+    { "code": "missing-doctype-name", "line": 1, "col": 10 }
+]},
+
+{"description":"DOCTYPE without space before name",
+"input":"<!DOCTYPEhtml>",
+"output":[["DOCTYPE", "html", null, null, true]],
+"errors":[
+    { "code": "missing-whitespace-before-doctype-name", "line": 1, "col": 10 }
+]},
+
+{"description":"Incorrect DOCTYPE without a space before name",
+"input":"<!DOCTYPEfoo>",
+"output":[["DOCTYPE", "foo", null, null, true]],
+"errors":[
+    { "code": "missing-whitespace-before-doctype-name", "line": 1, "col": 10 }
+]},
+
+{"description":"DOCTYPE with publicId",
+"input":"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML Transitional 4.01//EN\">",
+"output":[["DOCTYPE", "html", "-//W3C//DTD HTML Transitional 4.01//EN", null, true]]},
+
+{"description":"DOCTYPE with EOF after PUBLIC",
+"input":"<!DOCTYPE html PUBLIC",
+"output":[["DOCTYPE", "html", null, null, false]],
+"errors": [
+    { "code": "eof-in-doctype", "col": 22, "line": 1 }
+]},
+
+{"description":"DOCTYPE with EOF after PUBLIC '",
+"input":"<!DOCTYPE html PUBLIC '",
+"output":[["DOCTYPE", "html", "", null, false]],
+"errors": [
+    { "code": "eof-in-doctype", "col": 24, "line": 1 }
+]},
+
+{"description":"DOCTYPE with EOF after PUBLIC 'x",
+"input":"<!DOCTYPE html PUBLIC 'x",
+"output":[["DOCTYPE", "html", "x", null, false]],
+"errors": [
+    { "code": "eof-in-doctype", "col": 25, "line": 1 }
+]},
+
+{"description":"DOCTYPE with systemId",
+"input":"<!DOCTYPE html SYSTEM \"-//W3C//DTD HTML Transitional 4.01//EN\">",
+"output":[["DOCTYPE", "html", null, "-//W3C//DTD HTML Transitional 4.01//EN", true]]},
+
+{"description":"DOCTYPE with single-quoted systemId",
+"input":"<!DOCTYPE html SYSTEM '-//W3C//DTD HTML Transitional 4.01//EN'>",
+"output":[["DOCTYPE", "html", null, "-//W3C//DTD HTML Transitional 4.01//EN", true]]},
+
+{"description":"DOCTYPE with publicId and systemId",
+"input":"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML Transitional 4.01//EN\" \"-//W3C//DTD HTML Transitional 4.01//EN\">",
+"output":[["DOCTYPE", "html", "-//W3C//DTD HTML Transitional 4.01//EN", "-//W3C//DTD HTML Transitional 4.01//EN", true]]},
+
+{"description":"DOCTYPE with > in double-quoted publicId",
+"input":"<!DOCTYPE html PUBLIC \">x",
+"output":[["DOCTYPE", "html", "", null, false], ["Character", "x"]],
+"errors": [
+    { "code": "abrupt-doctype-public-identifier", "col": 24, "line": 1 }
+]},
+
+{"description":"DOCTYPE with > in single-quoted publicId",
+"input":"<!DOCTYPE html PUBLIC '>x",
+"output":[["DOCTYPE", "html", "", null, false], ["Character", "x"]],
+"errors": [
+    { "code": "abrupt-doctype-public-identifier", "col": 24, "line": 1 }
+]},
+
+{"description":"DOCTYPE with > in double-quoted systemId",
+"input":"<!DOCTYPE html PUBLIC \"foo\" \">x",
+"output":[["DOCTYPE", "html", "foo", "", false], ["Character", "x"]],
+"errors": [
+    { "code": "abrupt-doctype-system-identifier", "col": 30, "line": 1 }
+]},
+
+{"description":"DOCTYPE with > in single-quoted systemId",
+"input":"<!DOCTYPE html PUBLIC 'foo' '>x",
+"output":[["DOCTYPE", "html", "foo", "", false], ["Character", "x"]],
+"errors": [
+    { "code": "abrupt-doctype-system-identifier", "col": 30, "line": 1 }
+]},
+
+{"description":"Incomplete doctype",
+"input":"<!DOCTYPE html ",
+"output":[["DOCTYPE", "html", null, null, false]],
+"errors":[
+    { "code": "eof-in-doctype", "line": 1, "col": 16 }
+]},
+
+{"description":"Numeric entity representing the NUL character",
+"input":"&#0000;",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "null-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description":"Hexadecimal entity representing the NUL character",
+"input":"&#x0000;",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "null-character-reference", "line": 1, "col": 9 }
+]},
+
+{"description":"Numeric entity representing a codepoint after 1114111 (U+10FFFF)",
+"input":"&#2225222;",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "character-reference-outside-unicode-range", "line": 1, "col": 11 }
+]},
+
+{"description":"Hexadecimal entity representing a codepoint after 1114111 (U+10FFFF)",
+"input":"&#x1010FFFF;",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "character-reference-outside-unicode-range", "line": 1, "col": 13 }
+]},
+
+{"description":"Hexadecimal entity pair representing a surrogate pair",
+"input":"&#xD869;&#xDED6;",
+"output":[["Character", "\uFFFD\uFFFD"]],
+"errors":[
+    { "code": "surrogate-character-reference", "line": 1, "col": 9 },
+    { "code": "surrogate-character-reference", "line": 1, "col": 17 }
+]},
+
+{"description":"Hexadecimal entity with mixed uppercase and lowercase",
+"input":"&#xaBcD;",
+"output":[["Character", "\uABCD"]]},
+
+{"description":"Entity without a name",
+"input":"&;",
+"output":[["Character", "&;"]]},
+
+{"description":"Unescaped ampersand in attribute value",
+"input":"<h a='&'>",
+"output":[["StartTag", "h", { "a":"&" }]]},
+
+
+{"description":"StartTag containing <",
+"input":"<a<b>",
+"output":[["StartTag", "a<b", { }]]},
+
+{"description":"Non-void element containing trailing /",
+"input":"<h/>",
+"output":[["StartTag","h",{},true]]},
+
+{"description":"Void element with permitted slash",
+"input":"<br/>",
+"output":[["StartTag","br",{},true]]},
+
+{"description":"Void element with permitted slash (with attribute)",
+"input":"<br foo='bar'/>",
+"output":[["StartTag","br",{"foo":"bar"},true]]},
+
+{"description":"StartTag containing /",
+"input":"<h/a='b'>",
+"output":[["StartTag", "h", { "a":"b" }]],
+"errors":[
+    { "code": "unexpected-solidus-in-tag", "line": 1, "col": 4 }
+]},
+
+{"description":"Double-quoted attribute value",
+"input":"<h a=\"b\">",
+"output":[["StartTag", "h", { "a":"b" }]]},
+
+{"description":"Unescaped </",
+"input":"</",
+"output":[["Character", "</"]],
+"errors":[
+    { "code": "eof-before-tag-name", "line": 1, "col": 3 }
+]},
+
+{"description":"Illegal end tag name",
+"input":"</1>",
+"output":[["Comment", "1"]],
+"errors":[
+    { "code": "invalid-first-character-of-tag-name", "line": 1, "col": 3 }
+]},
+
+{"description":"Simili processing instruction",
+"input":"<?namespace>",
+"output":[["Comment", "?namespace"]],
+"errors":[
+    { "code": "unexpected-question-mark-instead-of-tag-name", "line": 1, "col": 2 }
+]},
+
+{"description":"A bogus comment stops at >, even if preceeded by two dashes",
+"input":"<?foo-->",
+"output":[["Comment", "?foo--"]],
+"errors":[
+    { "code": "unexpected-question-mark-instead-of-tag-name", "line": 1, "col": 2 }
+]},
+
+{"description":"Unescaped <",
+"input":"foo < bar",
+"output":[["Character", "foo < bar"]],
+"errors":[
+    { "code": "invalid-first-character-of-tag-name", "line": 1, "col": 6 }
+]},
+
+{"description":"Null Byte Replacement",
+"input":"\u0000",
+"output":[["Character", "\u0000"]],
+"errors":[
+    { "code": "unexpected-null-character", "line": 1, "col": 1 }
+]},
+
+{"description":"Comment with dash",
+"input":"<!---x",
+"output":[["Comment", "-x"]],
+"errors":[
+    { "code": "eof-in-comment", "line": 1, "col": 7 }
+]},
+
+{"description":"Entity + newline",
+"input":"\nx\n&gt;\n",
+"output":[["Character","\nx\n>\n"]]},
+
+{"description":"Start tag with no attributes but space before the greater-than sign",
+"input":"<h >",
+"output":[["StartTag", "h", {}]]},
+
+{"description":"Empty attribute followed by uppercase attribute",
+"input":"<h a B=''>",
+"output":[["StartTag", "h", {"a":"", "b":""}]]},
+
+{"description":"Double-quote after attribute name",
+"input":"<h a \">",
+"output":[["StartTag", "h", {"a":"", "\"":""}]],
+"errors":[
+    { "code": "unexpected-character-in-attribute-name", "line": 1, "col": 6 }
+]},
+
+{"description":"Single-quote after attribute name",
+"input":"<h a '>",
+"output":[["StartTag", "h", {"a":"", "'":""}]],
+"errors":[
+    { "code": "unexpected-character-in-attribute-name", "line": 1, "col": 6 }
+]},
+
+{"description":"Empty end tag with following characters",
+"input":"a</>bc",
+"output":[["Character", "abc"]],
+"errors":[
+    { "code": "missing-end-tag-name", "line": 1, "col": 4 }
+]},
+
+{"description":"Empty end tag with following tag",
+"input":"a</><b>c",
+"output":[["Character", "a"], ["StartTag", "b", {}], ["Character", "c"]],
+"errors":[
+    { "code": "missing-end-tag-name", "line": 1, "col": 4 }
+]},
+
+{"description":"Empty end tag with following comment",
+"input":"a</><!--b-->c",
+"output":[["Character", "a"], ["Comment", "b"], ["Character", "c"]],
+"errors":[
+    { "code": "missing-end-tag-name", "line": 1, "col": 4 }
+]},
+
+{"description":"Empty end tag with following end tag",
+"input":"a</></b>c",
+"output":[["Character", "a"], ["EndTag", "b"], ["Character", "c"]],
+"errors":[
+    { "code": "missing-end-tag-name", "line": 1, "col": 4 }
+]}
+
+]}
--- a/lib/html5lib/tests/testdata/tokenizer/test3.test
+++ b/lib/html5lib/tests/testdata/tokenizer/test3.test
--- a/lib/html5lib/tests/testdata/tokenizer/test4.test
+++ b/lib/html5lib/tests/testdata/tokenizer/test4.test
@ -0,0 +1,532 @@
+{"tests": [
+
+{"description":"< in attribute name",
+"input":"<z/0  <>",
+"output":[["StartTag", "z", {"0": "", "<": ""}]],
+"errors":[
+    { "code": "unexpected-solidus-in-tag", "line": 1, "col": 4 },
+    { "code": "unexpected-character-in-attribute-name", "line": 1, "col": 7 }
+]},
+
+{"description":"< in unquoted attribute value",
+"input":"<z x=<>",
+"output":[["StartTag", "z", {"x": "<"}]],
+"errors":[
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 6 }
+]},
+
+{"description":"= in unquoted attribute value",
+"input":"<z z=z=z>",
+"output":[["StartTag", "z", {"z": "z=z"}]],
+"errors":[
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 7 }
+]},
+
+{"description":"= attribute",
+"input":"<z =>",
+"output":[["StartTag", "z", {"=": ""}]],
+"errors":[
+    { "code": "unexpected-equals-sign-before-attribute-name", "line": 1, "col": 4 }
+]},
+
+{"description":"== attribute",
+"input":"<z ==>",
+"output":[["StartTag", "z", {"=": ""}]],
+"errors":[
+    { "code": "unexpected-equals-sign-before-attribute-name", "line": 1, "col": 4 },
+    { "code": "missing-attribute-value", "line": 1, "col": 6 }
+]},
+
+{"description":"=== attribute",
+"input":"<z ===>",
+"output":[["StartTag", "z", {"=": "="}]],
+"errors":[
+    { "code": "unexpected-equals-sign-before-attribute-name", "line": 1, "col": 4 },
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 6 }
+]},
+
+{"description":"==== attribute",
+"input":"<z ====>",
+"output":[["StartTag", "z", {"=": "=="}]],
+"errors":[
+    { "code": "unexpected-equals-sign-before-attribute-name", "line": 1, "col": 4 },
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 6 },
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 7 }
+]},
+
+{"description":"\" after ampersand in double-quoted attribute value",
+"input":"<z z=\"&\">",
+"output":[["StartTag", "z", {"z": "&"}]]},
+
+{"description":"' after ampersand in double-quoted attribute value",
+"input":"<z z=\"&'\">",
+"output":[["StartTag", "z", {"z": "&'"}]]},
+
+{"description":"' after ampersand in single-quoted attribute value",
+"input":"<z z='&'>",
+"output":[["StartTag", "z", {"z": "&"}]]},
+
+{"description":"\" after ampersand in single-quoted attribute value",
+"input":"<z z='&\"'>",
+"output":[["StartTag", "z", {"z": "&\""}]]},
+
+{"description":"Text after bogus character reference",
+"input":"<z z='&xlink_xmlns;'>bar<z>",
+"output":[["StartTag","z",{"z":"&xlink_xmlns;"}],["Character","bar"],["StartTag","z",{}]]},
+
+{"description":"Text after hex character reference",
+"input":"<z z='&#x0020; foo'>bar<z>",
+"output":[["StartTag","z",{"z":"  foo"}],["Character","bar"],["StartTag","z",{}]]},
+
+{"description":"Attribute name starting with \"",
+"input":"<foo \"='bar'>",
+"output":[["StartTag", "foo", {"\"": "bar"}]],
+"errors":[
+    { "code": "unexpected-character-in-attribute-name", "line": 1, "col": 6 }
+]},
+
+{"description":"Attribute name starting with '",
+"input":"<foo '='bar'>",
+"output":[["StartTag", "foo", {"'": "bar"}]],
+"errors":[
+    { "code": "unexpected-character-in-attribute-name", "line": 1, "col": 6 }
+]},
+
+{"description":"Attribute name containing \"",
+"input":"<foo a\"b='bar'>",
+"output":[["StartTag", "foo", {"a\"b": "bar"}]],
+"errors":[
+    { "code": "unexpected-character-in-attribute-name", "line": 1, "col": 7 }
+]},
+
+{"description":"Attribute name containing '",
+"input":"<foo a'b='bar'>",
+"output":[["StartTag", "foo", {"a'b": "bar"}]],
+"errors":[
+    { "code": "unexpected-character-in-attribute-name", "line": 1, "col": 7 }
+]},
+
+{"description":"Unquoted attribute value containing '",
+"input":"<foo a=b'c>",
+"output":[["StartTag", "foo", {"a": "b'c"}]],
+"errors":[
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 9 }
+]},
+
+
+{"description":"Unquoted attribute value containing \"",
+"input":"<foo a=b\"c>",
+"output":[["StartTag", "foo", {"a": "b\"c"}]],
+"errors":[
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 9 }
+]},
+
+{"description":"Double-quoted attribute value not followed by whitespace",
+"input":"<foo a=\"b\"c>",
+"output":[["StartTag", "foo", {"a": "b", "c": ""}]],
+"errors":[
+    { "code": "missing-whitespace-between-attributes", "line": 1, "col": 11 }
+]},
+
+{"description":"Single-quoted attribute value not followed by whitespace",
+"input":"<foo a='b'c>",
+"output":[["StartTag", "foo", {"a": "b", "c": ""}]],
+"errors":[
+    { "code": "missing-whitespace-between-attributes", "line": 1, "col": 11 }
+]},
+
+{"description":"Quoted attribute followed by permitted /",
+"input":"<br a='b'/>",
+"output":[["StartTag","br",{"a":"b"},true]]},
+
+{"description":"Quoted attribute followed by non-permitted /",
+"input":"<bar a='b'/>",
+"output":[["StartTag","bar",{"a":"b"},true]]},
+
+{"description":"CR EOF after doctype name",
+"input":"<!doctype html \r",
+"output":[["DOCTYPE", "html", null, null, false]],
+"errors":[
+    { "code": "eof-in-doctype", "line": 2, "col": 1 }
+]},
+
+{"description":"CR EOF in tag name",
+"input":"<z\r",
+"output":[],
+"errors":[
+    { "code": "eof-in-tag", "line": 2, "col": 1 }
+]},
+
+{"description":"Slash EOF in tag name",
+"input":"<z/",
+"output":[],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 4 }
+]},
+
+{"description":"Zero hex numeric entity",
+"input":"&#x0",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "missing-semicolon-after-character-reference", "line": 1, "col": 5 },
+    { "code": "null-character-reference", "line": 1, "col": 5 }
+]},
+
+{"description":"Zero decimal numeric entity",
+"input":"&#0",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "missing-semicolon-after-character-reference", "line": 1, "col": 4 },
+    { "code": "null-character-reference", "line": 1, "col": 4 }
+]},
+
+{"description":"Zero-prefixed hex numeric entity",
+"input":"&#x000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000041;",
+"output":[["Character", "A"]]},
+
+{"description":"Zero-prefixed decimal numeric entity",
+"input":"&#000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000065;",
+"output":[["Character", "A"]]},
+
+{"description":"Empty hex numeric entities",
+"input":"&#x &#X ",
+"output":[["Character", "&#x &#X "]],
+"errors":[
+    { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 4 },
+    { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 8 }
+]},
+
+{"description":"Invalid digit in hex numeric entity",
+"input":"&#xZ",
+"output":[["Character", "&#xZ"]],
+"errors":[
+    { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 4 }
+]},
+
+{"description":"Empty decimal numeric entities",
+"input":"&# &#; ",
+"output":[["Character", "&# &#; "]],
+"errors":[
+    { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 3 },
+    { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 6 }
+]},
+
+{"description":"Invalid digit in decimal numeric entity",
+"input":"&#A",
+"output":[["Character", "&#A"]],
+"errors":[
+    { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 3 }
+]},
+
+{"description":"Non-BMP numeric entity",
+"input":"&#x10000;",
+"output":[["Character", "\uD800\uDC00"]]},
+
+{"description":"Maximum non-BMP numeric entity",
+"input":"&#X10FFFF;",
+"output":[["Character", "\uDBFF\uDFFF"]],
+"errors":[
+    { "code": "noncharacter-character-reference", "line": 1, "col": 11 }
+]},
+
+
+{"description":"Above maximum numeric entity",
+"input":"&#x110000;",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "character-reference-outside-unicode-range", "line": 1, "col": 11 }
+]},
+
+{"description":"32-bit hex numeric entity",
+"input":"&#x80000041;",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "character-reference-outside-unicode-range", "line": 1, "col": 13 }
+]},
+
+{"description":"33-bit hex numeric entity",
+"input":"&#x100000041;",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "character-reference-outside-unicode-range", "line": 1, "col": 14 }
+]},
+
+{"description":"33-bit decimal numeric entity",
+"input":"&#4294967361;",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "character-reference-outside-unicode-range", "line": 1, "col": 14 }
+]},
+
+{"description":"65-bit hex numeric entity",
+"input":"&#x10000000000000041;",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "character-reference-outside-unicode-range", "line": 1, "col": 22 }
+]},
+
+{"description":"65-bit decimal numeric entity",
+"input":"&#18446744073709551681;",
+"output":[["Character", "\uFFFD"]],
+"errors":[
+    { "code": "character-reference-outside-unicode-range", "line": 1, "col": 24 }
+]},
+
+{"description":"Surrogate code point edge cases",
+"input":"&#xD7FF;&#xD800;&#xD801;&#xDFFE;&#xDFFF;&#xE000;",
+"output":[["Character", "\uD7FF\uFFFD\uFFFD\uFFFD\uFFFD\uE000"]],
+"errors":[
+    { "code": "surrogate-character-reference", "line": 1, "col": 17 },
+    { "code": "surrogate-character-reference", "line": 1, "col": 25 },
+    { "code": "surrogate-character-reference", "line": 1, "col": 33 },
+    { "code": "surrogate-character-reference", "line": 1, "col": 41 }
+]},
+
+{"description":"Uppercase start tag name",
+"input":"<X>",
+"output":[["StartTag", "x", {}]]},
+
+{"description":"Uppercase end tag name",
+"input":"</X>",
+"output":[["EndTag", "x"]]},
+
+{"description":"Uppercase attribute name",
+"input":"<x X>",
+"output":[["StartTag", "x", { "x":"" }]]},
+
+{"description":"Tag/attribute name case edge values",
+"input":"<x@AZ[`az{ @AZ[`az{>",
+"output":[["StartTag", "x@az[`az{", { "@az[`az{":"" }]]},
+
+{"description":"Duplicate different-case attributes",
+"input":"<x x=1 x=2 X=3>",
+"output":[["StartTag", "x", { "x":"1" }]],
+"errors":[
+    { "code": "duplicate-attribute", "line": 1, "col": 9 },
+    { "code": "duplicate-attribute", "line": 1, "col": 13 }
+]},
+
+{"description":"Uppercase close tag attributes",
+"input":"</x X>",
+"output":[["EndTag", "x"]],
+"errors":[
+    { "code": "end-tag-with-attributes", "line": 1, "col": 6 }
+]},
+
+{"description":"Duplicate close tag attributes",
+"input":"</x x x>",
+"output":[["EndTag", "x"]],
+"errors":[
+    { "code": "duplicate-attribute", "line": 1, "col": 8 },
+    { "code": "end-tag-with-attributes", "line": 1, "col": 8 }
+]},
+
+{"description":"Permitted slash",
+"input":"<br/>",
+"output":[["StartTag","br",{},true]]},
+
+{"description":"Non-permitted slash",
+"input":"<xr/>",
+"output":[["StartTag","xr",{},true]]},
+
+{"description":"Permitted slash but in close tag",
+"input":"</br/>",
+"output":[["EndTag", "br"]],
+"errors":[
+    { "code": "end-tag-with-trailing-solidus", "line": 1, "col": 6 }
+]},
+
+{"description":"Doctype public case-sensitivity (1)",
+"input":"<!DoCtYpE HtMl PuBlIc \"AbC\" \"XyZ\">",
+"output":[["DOCTYPE", "html", "AbC", "XyZ", true]]},
+
+{"description":"Doctype public case-sensitivity (2)",
+"input":"<!dOcTyPe hTmL pUbLiC \"aBc\" \"xYz\">",
+"output":[["DOCTYPE", "html", "aBc", "xYz", true]]},
+
+{"description":"Doctype system case-sensitivity (1)",
+"input":"<!DoCtYpE HtMl SyStEm \"XyZ\">",
+"output":[["DOCTYPE", "html", null, "XyZ", true]]},
+
+{"description":"Doctype system case-sensitivity (2)",
+"input":"<!dOcTyPe hTmL sYsTeM \"xYz\">",
+"output":[["DOCTYPE", "html", null, "xYz", true]]},
+
+{"description":"U+0000 in lookahead region after non-matching character",
+"input":"<!doc>\u0000",
+"output":[["Comment", "doc"], ["Character", "\u0000"]],
+"errors":[
+    { "code": "incorrectly-opened-comment", "line": 1, "col": 3 },
+    { "code": "unexpected-null-character", "line": 1, "col": 7 }
+]},
+
+{"description":"U+0000 in lookahead region",
+"input":"<!doc\u0000",
+"output":[["Comment", "doc\uFFFD"]],
+"errors":[
+    { "code": "incorrectly-opened-comment", "line": 1, "col": 3 },
+    { "code": "unexpected-null-character", "line": 1, "col": 6 }
+]},
+
+{"description":"U+0080 in lookahead region",
+"input":"<!doc\u0080",
+"output":[["Comment", "doc\u0080"]],
+"errors":[
+    { "code": "incorrectly-opened-comment", "line": 1, "col": 3 },
+    { "code": "control-character-in-input-stream", "line": 1, "col": 6 }
+]},
+
+{"description":"U+FDD1 in lookahead region",
+"input":"<!doc\uFDD1",
+"output":[["Comment", "doc\uFDD1"]],
+"errors":[
+    { "code": "incorrectly-opened-comment", "line": 1, "col": 3 },
+    { "code": "noncharacter-in-input-stream", "line": 1, "col": 6 }
+]},
+
+{"description":"U+1FFFF in lookahead region",
+"input":"<!doc\uD83F\uDFFF",
+"output":[["Comment", "doc\uD83F\uDFFF"]],
+"errors":[
+    { "code": "incorrectly-opened-comment", "line": 1, "col": 3 },
+    { "code": "noncharacter-in-input-stream", "line": 1, "col": 6 }
+]},
+
+{"description":"CR followed by non-LF",
+"input":"\r?",
+"output":[["Character", "\n?"]]},
+
+{"description":"CR at EOF",
+"input":"\r",
+"output":[["Character", "\n"]]},
+
+{"description":"LF at EOF",
+"input":"\n",
+"output":[["Character", "\n"]]},
+
+{"description":"CR LF",
+"input":"\r\n",
+"output":[["Character", "\n"]]},
+
+{"description":"CR CR",
+"input":"\r\r",
+"output":[["Character", "\n\n"]]},
+
+{"description":"LF LF",
+"input":"\n\n",
+"output":[["Character", "\n\n"]]},
+
+{"description":"LF CR",
+"input":"\n\r",
+"output":[["Character", "\n\n"]]},
+
+{"description":"text CR CR CR text",
+"input":"text\r\r\rtext",
+"output":[["Character", "text\n\n\ntext"]]},
+
+{"description":"Doctype publik",
+"input":"<!DOCTYPE html PUBLIK \"AbC\" \"XyZ\">",
+"output":[["DOCTYPE", "html", null, null, false]],
+"errors":[
+    { "code": "invalid-character-sequence-after-doctype-name", "line": 1, "col": 16 }
+]},
+
+{"description":"Doctype publi",
+"input":"<!DOCTYPE html PUBLI",
+"output":[["DOCTYPE", "html", null, null, false]],
+"errors":[
+    { "code": "invalid-character-sequence-after-doctype-name", "line": 1, "col": 16 }
+]},
+
+{"description":"Doctype sistem",
+"input":"<!DOCTYPE html SISTEM \"AbC\">",
+"output":[["DOCTYPE", "html", null, null, false]],
+"errors":[
+    { "code": "invalid-character-sequence-after-doctype-name", "line": 1, "col": 16 }
+]},
+
+{"description":"Doctype sys",
+"input":"<!DOCTYPE html SYS",
+"output":[["DOCTYPE", "html", null, null, false]],
+"errors":[
+    { "code": "invalid-character-sequence-after-doctype-name", "line": 1, "col": 16 }
+]},
+
+{"description":"Doctype html x>text",
+"input":"<!DOCTYPE html x>text",
+"output":[["DOCTYPE", "html", null, null, false], ["Character", "text"]],
+"errors":[
+    { "code": "invalid-character-sequence-after-doctype-name", "line": 1, "col": 16 }
+]},
+
+{"description":"Grave accent in unquoted attribute",
+"input":"<a a=aa`>",
+"output":[["StartTag", "a", {"a":"aa`"}]],
+"errors":[
+    { "code": "unexpected-character-in-unquoted-attribute-value", "line": 1, "col": 8 }
+]},
+
+{"description":"EOF in tag name state ",
+"input":"<a",
+"output":[],
+"errors": [
+    { "code": "eof-in-tag", "line": 1, "col": 3 }
+]},
+
+{"description":"EOF in before attribute name state",
+"input":"<a ",
+"output":[],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 4 }
+]},
+
+{"description":"EOF in attribute name state",
+"input":"<a a",
+"output":[],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 5 }
+]},
+
+{"description":"EOF in after attribute name state",
+"input":"<a a ",
+"output":[],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 6 }
+]},
+
+{"description":"EOF in before attribute value state",
+"input":"<a a =",
+"output":[],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 7 }
+]},
+
+{"description":"EOF in attribute value (double quoted) state",
+"input":"<a a =\"a",
+"output":[],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 9 }
+]},
+
+{"description":"EOF in attribute value (single quoted) state",
+"input":"<a a ='a",
+"output":[],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 9 }
+]},
+
+{"description":"EOF in attribute value (unquoted) state",
+"input":"<a a =a",
+"output":[],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 8 }
+]},
+
+{"description":"EOF in after attribute value state",
+"input":"<a a ='a'",
+"output":[],
+"errors":[
+    { "code": "eof-in-tag", "line": 1, "col": 10 }
+]}
+
+]}
--- a/lib/html5lib/tests/testdata/tokenizer/unicodeChars.test
+++ b/lib/html5lib/tests/testdata/tokenizer/unicodeChars.test
--- a/lib/html5lib/tests/testdata/tokenizer/unicodeCharsProblematic.test
+++ b/lib/html5lib/tests/testdata/tokenizer/unicodeCharsProblematic.test
@ -0,0 +1,41 @@
+{"tests" : [
+{"description": "Invalid Unicode character U+DFFF",
+"doubleEscaped":true,
+"input": "\\uDFFF",
+"output":[["Character", "\\uDFFF"]],
+"errors":[
+    { "code": "surrogate-in-input-stream", "line": 1, "col": 1 }
+]},
+
+{"description": "Invalid Unicode character U+D800",
+"doubleEscaped":true,
+"input": "\\uD800",
+"output":[["Character", "\\uD800"]],
+"errors":[
+    { "code": "surrogate-in-input-stream", "line": 1, "col": 1 }
+]},
+
+{"description": "Invalid Unicode character U+DFFF with valid preceding character",
+"doubleEscaped":true,
+"input": "a\\uDFFF",
+"output":[["Character", "a\\uDFFF"]],
+"errors":[
+    { "code": "surrogate-in-input-stream", "line": 1, "col": 2 }
+]},
+
+{"description": "Invalid Unicode character U+D800 with valid following character",
+"doubleEscaped":true,
+"input": "\\uD800a",
+"output":[["Character", "\\uD800a"]],
+"errors":[
+    { "code": "surrogate-in-input-stream", "line": 1, "col": 1 }
+]},
+
+{"description":"CR followed by U+0000",
+"input":"\r\u0000",
+"output":[["Character", "\n\u0000"]],
+"errors":[
+    { "code": "unexpected-null-character", "line": 2, "col": 1 }
+]}
+]
+}
--- a/lib/html5lib/tests/testdata/tokenizer/xmlViolation.test
+++ b/lib/html5lib/tests/testdata/tokenizer/xmlViolation.test
@ -0,0 +1,20 @@
+{"xmlViolationTests": [
+
+{"description":"Non-XML character",
+"input":"a\uFFFFb",
+"output":[["Character","a\uFFFDb"]]},
+
+{"description":"Non-XML space",
+"input":"a\u000Cb",
+"output":[["Character","a b"]]},
+
+{"description":"Double hyphen in comment",
+"input":"<!-- foo -- bar -->",
+"output":[["Comment"," foo - - bar "]]},
+
+{"description":"FF between attributes",
+"input":"<a b=''\u000Cc=''>",
+"output":[["StartTag","a",{"b":"","c":""}]]}
+]}
+
+