Intel hasdevelopments related to the project aimed at creating a machine learning system for improving code quality. The toolkit prepared by the project allows, based on a model trained on a large amount of existing code, to identify various errors and anomalies in source texts written in high-level languages such as C / C ++. The system is suitable for detecting various kinds of problems in your code, from detecting typos and incorrect combinations of types, to identifying missing checks for null values in pointers and problems with memory. The ControlFlag code is written in C ++ and is source under the MIT license.
The system self-learns by building a statistical model of the existing array of open source code published on GitHub and similar public repositories. At the training stage, the system determines typical templates for constructing structures in the code and builds a syntax tree of connections between these templates, reflecting the flow of code execution in the program. As a result, a reference decision tree is formed, which combines the development experience of all analyzed source texts.
A similar process of defining patterns is performed for the code under test, which is checked against a reference decision tree. Large discrepancies with adjacent branches indicate an anomaly in the pattern being checked. The system also allows not only to identify an error in the template, but also to suggest a fix. For example, in the OpenSSL code, the construct “(s1 == NULL) ∧ (s2 == NULL)” was detected, which appeared in the syntax tree only 8 times, while the closest branch with the value “(s1 == NULL) || ( s2 == NULL) “was encountered about 7 thousand times. The system also detected the anomaly “(s1 == NULL) | (s2 == NULL)” which occurred 32 times in the tree.
When analyzing the code snippet “if (x = 7) y = x;” the system has determined that the “variable == number” construct is usually used in the “if” statement to compare numeric values, so the indication “variable = number” in the “if” statement is most likely caused by a typo. Traditional static analyzers would catch such an error, but, unlike them, ControlFlag does not apply ready-made rules, in which it is difficult to foresee all possible options, but proceeds from the statistics of using all kinds of constructs in a large number of projects.
As an experiment, using ControlFlag in the source code of the cURL utility, which is often cited as an example of high-quality and tested code, an error was revealed unnoticed by static analyzers when using the “s-> keepon” structure element, which had a numeric type, but was compared with the boolean value TRUE … In the OpenSSL code, in addition to the aforementioned problem with “(s1 == NULL) ∧ (s2 == NULL)”, anomalies were also detected in the expressions “(-2 == rv)” (the minus was a typo) and “BIO_puts (bp, “:”) <= 0) “(in the context of checking the successful completion of the function, it should have been” == 0 “). It is also reported that the use of ControlFlag made it possible to identify several hundred errors in non-specific proprietary software, leading to crashes and problems with memory.