Getting Started with Semgrep and Finding Vulnerabilities

Indroduction

Semgrep is a static analysis tool that is loved by developers as well as the AppSec community for its ease of use. It supports most programming languages with the easiness of creating rules making it a powerful tool that the AppSec team can use in their pipelines to get better results by analyzing the source code and preventing different vulnerabilities from their codebases.

Table of Contents

Semgrep offers various unique advantages as compared to other open-source SAST tools as listed below:

It is Open Source
It helps us make custom rules for detecting vulnerabilities, allowing us to perform a context-specific scan on our code. It also provides a public register of custom rules that can be used.
Semgrep is extremely fast and is the most suitable to be introduced in a DevOps pipeline.
It spools a well-formatted and stable JSON output.
It is extremely lightweight and has an easy-to-install binary. It can also be run using Docker.
Most importantly, this tool supports Python, JavaScript, Java, Go, C and JSON syntaxes!
Any developer can use it; does not require expertise

Here are the things that you will learn from this:

The basics of Semgrep.
How to install and run Semgrep.
Using semgrep rules to find vulnerabilities in source code.
Creating custom rules for Semgrep.

To further understand this tool, we will understand and write Semgrep-rules for NodeJs and identify vulnerabilities in a Vulnerable application code. So, let’s get started!!

We can try Semgrep either using the CLI/Docker image or using the live interface.

In this blog, we will be considering CLI only.

CLI

Install the tool. Use one of the following options depending on your system and preference: For macOS:

1    brew install semgrep

For Ubuntu, Windows through Windows Subsystem for Linux (WSL), Linux, macOS:

1    python3 -m pip install semgrep

To try Semgrep without installation run through Docker:

1docker run --rm -v "${PWD}:/src" returntocorp/semgrep semgrep --config=auto

Now let’s see how we can run a ruleset against source code.

The rules for the Semgrep are defined in a YAML file. This file is run against the source code to match similar patterns.

Here is an example of a sample YAML file.

 1    rules: 
 2       - id: detect-child-process 
 3         metadata: 
 4           cwe: "CWE-78: Improper Neutralization of Special Elements used in an OS Command 
 5             ('OS Command Injection')" 
 6           owasp: "A1: Injection" 
 7           source-rule-url: https://github.com/nodesecurity/eslint-plugin-security/blob/master/rules/detect-child-process.js 
 8           category: security 
 9           technology: 
10             - javascript 
11           license: Commons Clause License Condition v1.0[LGPL-2.1-only] 
12         message: Detected non-literal calls to $EXEC(). This could lead to a command 
13           injection vulnerability. 
14         patterns: 
15           - pattern-either: 
16               - pattern: exec($CMD,...) 
17               - pattern: execSync($CMD,...) 
18         severity: ERROR 
19         languages: 
20           - javascript

Understanding Different Schema

Let’s understand the different schema in the rule.

ID – ID is a unique descriptive identifier that identifies the rule. In the example, the id is given as ‘detect-child-process’.

Metadata – Metadata is used to provide arbitrary or custom data about the rule. In the example, the metadata field is used to provide the details about the vulnerability like CWE, description, etc. This is an optional field.

Message – Message field is used to provide output regarding why this rule is triggered. In the example, it is mentioned that this issue may lead to Command Injection.

Severity – Severity field gives the severity of the triggered rule. It includes INFO, WARNING, or ERROR.

Languages – Languages field is used to define which language this rule is applicable. For example, here is javascript.

Pattern – Pattern operator is used to provide the expression to check in the code. In the example, there are multiple expressions so instead of ‘pattern’ patterns are used. The pattern is for a single expression.

Patterns – Patterns operator is used when there is multiple expression using logical AND. This is useful when we need to check if multiple expressions must be true.

pattern-either – Pattern-either is similar to the patterns but instead of logical AND logical OR is used. This is useful when we need to check if some of the given expressions are true.

These are the common fields or operators used to create different rules. They are only the main ones and there are many others that you can refer to, here.

To run these rules against code, first, we need to save this YAML file. Let’s say the file is ‘rule.yaml‘

To run this rule, use the following command.

1    semgrep --config path/to/rule.yaml

Writing Rules

Now let’s look at how we can write custom rules.

Before that, we can learn the two major operators while writing the patterns these operators are independent of the languages and it is common for all programming languages.

Ellipsis […] – This operator acts as a filler. It can be used as a filler for unknown arguments or spaces.

Metavariables $VARIABLE – These act as placeholders for unknown variables that need to be matched before being assigned a value.

Visit [Semgrep Playground] (https://semgrep.dev/editor) to play with the Semgrep and follow along with the demos below to test out a few of Semgrep’s features:

Let’s try some simple patterns first. We will look into identifying function calls from a code snippet.

code:

 1    import exec as safe_function 
 2     safe_function(user_input) 
 3      
 4        exec("ls") 
 5          
 6        exec(some_var) 
 7          
 8        some_exec(foo) 
 9          
10        exec (foo) 
11          
12        exec ( 
13             bar ) 
14         #exec(foo) 
15          
16        print("exec(bar)")

Fig 1. Screenshot showing the Semgrep playground for testing the rules.

This one is simple by giving a pattern as ‘exec(…)’ will find all the instances of exec() in the code snippet. note that

Semgrep will identify the function despite differences in syntax and aliases such as safe_function
Semgrep will ignore the comments and string literals.

Now let’s try to do it for XSS. Consider the given code snippet.

code:

 1    var express = require('express'); 
 2     var app = express(); 
 3      
 4    // ruleid:express_xss 
 5     app.get('/foo', function (req, res) { 
 6         var resp = req.query.name; 
 7          
 8        res.write(resp); 
 9     }); 
10      
11    // ruleid:express_xss 
12     app.get('/bar', function (req, res) { 
13         var foo = req.query.email; 
14         console.log("foo route"); 
15         res.write('Response</br>' + foo); 
16     }); 
17

Looking at the code we can see that the user data is not filtered for any input and thus will be vulnerable to an XSS attack

The advantage of Semgrep is that we may search for anything in our selected language. To begin creating a rule, we can simply copy the original code and then abstract it.
We can replace the ‘/foo’ with ‘…’ as it can be any string that will be an endpoint.
We can represent all the variables using meta variables. For that, we need to replace the variables with $ + variable name in capitals. From the code snippet given we can replace the name in req.query.name with $NAME Also, replace the ‘resp’ variable with $RESP which represents an unknown variable.

The rule will be like:

1     app.get('...', function (req, res) { 
2     var $RESP = req.query.$NAME; 
3      
4    res.write($RESP); 
5     });

Run this against the code and see the result. Do you think it will find two XSS in the code? No, it will only identify one XSS, not both.

Fig 2. Screenshot showing that the rule only found one XSS in the code.

If you compare both the codes, in the second vulnerable code there are some extra lines of code and extra arguments to take care of. To fix this we can use ellipses and arrow operators.

The updated rule will be:

1    app.get('...', function (req, res) { 
2     var $RESP = req.query.$NAME; 
3     ... 
4     res.write(<... $RESP ...>); 
5     });

We have added an extra ‘…’ to say there may be some more lines of code there and the arrow brackets signify that we know there might be some metavariables, but we don’t know exactly where they are.

This rule will identify both XSS.

Fig 3. Screenshot showing the modified rule which found both of the XSS.

Rulesets

While we have the opportunity to construct our own rules, Semgrep has over 1000 pre-existing rules that have been thoroughly reviewed by the company. It’s easy to get to these rule sets:

Semgrep Explore page got some predefined rules and the search option makes it easy to explore.
After you’ve chosen your ruleset, look for a section marked “Test and Run Locally” in the top right corner containing a command you can copy.
We can add our timeout 0 /directoryToTest to the end of this command, and the ruleset will be run against the given directory.

Here we discussed mostly the common things that you need to know about Semgrep before using it. There are several advanced useful rule syntaxes that you can refer to in the Semgrep documentation. For more explanation checkout this.

Final Thoughts:

Semgrep is a powerful tool that all Appsec teams should have in their arsenal. It’s not just like using regular expressions. It is a very easy and efficient way of creating rules and identifying issues in code using Semgrep. The speed at which Semgrep works is also commendable, which helps while implementing DevOps pipelines.

References:

Subscribe to our Newsletter

Resources

Tools

Community

Getting Started with Semgrep and Finding Vulnerabilities

Indroduction

CLI

Understanding Different Schema

Writing Rules

Rulesets

Final Thoughts:

References:

Subscribe to our newsletter

Services

Products

Conference

Resources

About