Researchers at the University of Cambridge have published a new technique for subtly substituting malicious code in peer-reviewed sources. The prepared attack method ( CVE-2021-42574 ) is presented under the name Trojan Source and is based on the formation of text that looks different to the compiler / interpreter and the person viewing the code. Examples of application of the method are demonstrated for various compilers and interpreters supplied for C, C ++ (gcc and clang), C #, JavaScript (Node.js), Java (OpenJDK 16), Rust, Go and Python.
The method is based on the application of special Unicode characters in the comments to the code, which change the display order of bidirectional text. With the help of such control characters, some parts of the text can be displayed from left to right, and others from right to left. In everyday practice, such control characters can be used, for example, to insert Hebrew or Arabic strings into a file with code. But if you combine lines with different text directions in one line, using the specified characters, passages of text displayed from right to left can overlap the already existing ordinary text displayed from left to right.
Using this method, you can add a malicious construct to the code, but then make the text with this construct invisible when viewing the code, by adding characters shown from right to left in the following comment or inside the literal, which will result in completely different characters being superimposed on the malicious insertion. Such code will remain semantically correct, but will be interpreted and displayed differently.
While reviewing the code, the developer will be faced with the visual order of the characters and will see a suspicious comment in a modern text editor, web interface or IDE, but the compiler and interpreter will use the logical order of characters and handle the malicious insert as it is, regardless of bidirectional text. in the comment. Various popular code editors (VS Code, Emacs, Atom) and interfaces for viewing code in repositories (GitHub, BitBucket) are affected.
There are several ways to use the method to implement malicious actions: adding a hidden “return” expression, which leads to the termination of the function execution ahead of time; the conclusion in the commentary of expressions normally seen as valid constructs (for example, to disable important checks); assignment of other string values leading to string validation failures.
For example, an attacker might suggest a change that includes the line:
if access_level! = "user {U + 202E} {U + 2066} // Check if admin {U + 2069} {U + 2066}" {
which will be displayed in the review interface as
if access_level! = "user" {// Check if admin
Additionally, another attack option (CVE-2021-42694) was proposed, involving the use of homoglyphs , symbols that look similar in appearance, but differ in meaning and have different unicode codes (for example, the symbol “ɑ” resembles “a”, “ɡ” – “g”, “ɩ” – “l”). Such characters can be used in some languages in the names of functions and variables to mislead developers. For example, two functions can be defined with indistinguishable names that perform different actions. Without a detailed analysis, you cannot immediately understand which of these two functions is called in a particular place.
As a protective measure, it is recommended to implement in compilers, interpreters and assembly tools that support Unicode characters, displaying an error or warning if there are unpaired control characters in comments, string literals or identifiers that change the direction of output (U + 202A, U + 202B, U + 202C, U + 202D, U + 202E, U + 2066, U + 2067, U + 2068, U + 2069, U + 061C, U + 200E and U + 200F). Such characters should also be explicitly prohibited in the specifications of programming languages and should be taken into account in code editors and interfaces for working with repositories.