CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic
This attack is a specific variation on leveraging alternate encodings to bypass validation logic. This attack leverages the possibility to encode potentially harmful input in UTF-8 and submit it to applications not expecting or effective at validating this encoding standard making input filtering difficult. UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Legal UTF-8 characters are one to four bytes long. However, early version of the UTF-8 specification got some entries wrong (in some cases it permitted overlong characters). UTF-8 encoders are supposed to use the "shortest possible" encoding, but naive decoders may accept encodings that are longer than necessary. According to the RFC 3629, a particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters.
Weaknesses
# ID | Name | Type |
---|---|---|
CWE-20 | Improper Input Validation | weakness |
CWE-73 | External Control of File Name or Path | weakness |
CWE-74 | Improper Neutralization of Special Elements in Output Used by a Downstream Component ('Injection') | weakness |
CWE-172 | Encoding Error | weakness |
CWE-173 | Improper Handling of Alternate Encoding | weakness |
CWE-180 | Incorrect Behavior Order: Validate Before Canonicalize | weakness |
CWE-181 | Incorrect Behavior Order: Validate Before Filter | weakness |
CWE-692 | Incomplete Denylist to Cross-Site Scripting | weakness |
CWE-697 | Incorrect Comparison | weakness |