CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic

ID CAPEC-80

Typical Severity High

Likelihood Of Attack High

Status Draft

This attack is a specific variation on leveraging alternate encodings to bypass validation logic. This attack leverages the possibility to encode potentially harmful input in UTF-8 and submit it to applications not expecting or effective at validating this encoding standard making input filtering difficult. UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Legal UTF-8 characters are one to four bytes long. However, early version of the UTF-8 specification got some entries wrong (in some cases it permitted overlong characters). UTF-8 encoders are supposed to use the "shortest possible" encoding, but naive decoders may accept encodings that are longer than necessary. According to the RFC 3629, a particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters.

https://capec.mitre.org/data/definitions/80.html

Weaknesses

# ID	Name	Type
CWE-20	Improper Input Validation	weakness
CWE-73	External Control of File Name or Path	weakness
CWE-74	Improper Neutralization of Special Elements in Output Used by a Downstream Component ('Injection')	weakness
CWE-172	Encoding Error	weakness
CWE-173	Improper Handling of Alternate Encoding	weakness
CWE-180	Incorrect Behavior Order: Validate Before Canonicalize	weakness
CWE-181	Incorrect Behavior Order: Validate Before Filter	weakness
CWE-692	Incomplete Denylist to Cross-Site Scripting	weakness
CWE-697	Incorrect Comparison	weakness