Find and fix Python security issues with QL

This is one of the few posts I wrote in 2019 when working for Semmle (later acquired by GitHub) that was originally published on the Semmle blog that was transformed. Python library for QL has changed quite a bit since then, however, many principles are still relevant and helpful for anyone who would want to learn more about QL. See CodeQL for Python to learn more.

Overview¶

In this blog post, we’ll take a look at some security concerns that are particularly relevant to Python developers. There are already queries for some of these issues, and we’ll write new custom queries for the others. You can execute any query in this post against your own Python project.

When you’re writing code, it is very easy to accidentally introduce errors or vulnerabilities. On top of that, you need to be aware of any existing bugs in the implementation of the language you’re working in, which adds an additional burden. For instance, CPython developers may need to review security vulnerabilities present in the Python version they use.

Running static analysis on the source code can help you find code that would produce an incorrect result, open up hardware or software resources for malicious use, or cause a program to unexpectedly fail. Fixing those issues will make the program more secure. To learn about the Python security model, bytecode safety, and some typical security concerns, visit the Python Security resource which has an excellent set of reference resources and further readings.

Unfortunately, for anyone who is maintaining a legacy Python 2 codebase, and pre-2.7 versions in particular, quite a few bugs and some security issues have been addressed only in Python 3. So upgrading the code to the latest version of Python 3 is very often the only option if you want to keep your code secure. Although Python 3 is more secure than Python 2, you still can’t fully relax because it also suffers from security vulnerabilities, even the most recent versions, such as Python 3.6, 3.7, and 3.8. You can review the current security-related issues using the Python bug tracker. On this website, you will find many bugs which have a CVE number assigned such as CVE-2018-1000030 listed as CVE-2018-1000030: Python 2.7 readahead feature of file objects is not thread safe or CVE-2013-4238 listed as CVE-2013-4238: SSL module fails to handle NULL bytes inside subjectAltNames general names to mention just a few.

Here we categorize Python security concerns into two groups:

Issues in the Python interpreter or standard library written by Python core developers and contributors
Issues in Python user code written by developers writing Python programs.

Issues in CPython source code¶

If you are a developer writing your programs in Python, you have very little control over the source code of CPython. You could, of course, make the necessary changes to the source code and compile your own Python interpreter, however, this is something that only a few developers would find practical.

As an example, the urllib module didn’t parse passwords containing the # character correctly. This bug was fixed in the most recent version of Python 3 and also backported to previous versions. However, there are a few bugs that were fixed only in certain versions of Python and were not backported. For example, the Hash function is not randomized properly bug was fixed only in Python 3.4.0. This means that previous versions, such as Python 3.3 and Python 2.7, are still vulnerable. This puts some developers into a difficult situation if they cannot upgrade to the latest Python interpreter to take advantage of the latest security related fixes.

Semmle’s continuous security analysis service, LGTM.com, includes the CPython project, analyzing both the C and Python source code. If you develop security-sensitive applications, you should review the security-related alerts that are highlighted in the latest code. For example, the following alerts were found by queries that focus on potential vulnerabilities: CPython’s alert page on LGTM.com.

Issues in your own Python programs¶

In contrast, when you write your own Python programs, often it’s the choices that you make as you implement features that determine the security of your program. In the rest of this post, we look at some of the issues that can make your programs less secure, and provide guidelines on how avoid these common pitfalls. We will also share built-in queries and custom queries that you can use to find security-related issues in your code.

Inadequate DSA and RSA key length¶

The paper Transitioning the Use of Cryptographic Algorithms and Key Lengths published by the NIST Computer Security Resource Center, suggests using a key of size 2048 or larger for RSA and DSA algorithms. The Python cryptography package provides tools for working with private keys and has a user key_size parameter. See the Python code snippet in the docs for details. From the docs page:

key_size (int) – The length of the modulus in bits. It should be either 1024, 2048 or 3072. For keys generated in 2015 this should be at least 2048. Note that some applications (such as SSH) have not yet gained support for larger key sizes specified in FIPS 186-3 and are still restricted to only the 1024-bit keys specified in FIPS 186-2.

There is a built-in query, Use of weak cryptographic key, that highlights when values smaller than 2048 are passed to the key_size parameter.

For example, the query would report an alert for this Python code:

from cryptography.hazmat.primitives.asymmetric import dsa
from cryptography.hazmat.backends import default_backend
private_key = dsa.generate_private_key(
    key_size=512,
    backend=default_backend()
)

The query also identifies inadequate key lengths in code that uses the Crypto and Cryptodome Python packages. You can set a different minimum key length by editing the query and changing the result of the minimumSecureKeySize predicate, which is currently set to 2048 for both the DSA and RSA algorithms:

int minimumSecureKeySize(string algo) {
    algo = "DSA" and result = 2048
    or
    algo = "RSA" and result = 2048
    or
    algo = "ECC" and result = 224
}

Using the deprecated ‘pyCrypto’ package¶

PyCrypto is a mature Python cryptography toolkit that has gained popularity over the years. However, the package has quite a few issues, some of them affecting security, and the project was last updated over five years ago. One of those issues, AES.new with invalid parameter crashes python, is actually an exploitable vulnerability, CVE-2013-7459.

The current recommendation is to use some other Python package. For instance, cryptography, is a popular choice for many Python developers: * paramiko, one of the popular native Python SSHv2 protocol libraries, has switched to cryptography from pyCrypto; see this pull request for details.

twisted, a popular event-driven networking engine, has switched to cryptography from pyCrypto as well; see this pull request for details.

To check if there are any places where the pyCrypto package is imported and used, as in this Python code snippet:

from Crypto.Hash import SHA256
val = SHA256.new('abc'.encode('utf-8')).hexdigest()

we could write the following custom query:

/**
 * @name Using a deprecated pyCrypto package
 * @description Using an unmaintained tool kit with multiple security
   issues makes your code vulnerable to attack.
 * @kind problem
 * @tags security
 * @problem.severity error
 * @id py/using-insecure-pycrypto-package
 */

import python

from ImportExpr imp, Stmt s, Expr e, string moduleName
where
  moduleName = imp.getName() and
  s.getASubExpression() = e and
  (e = imp or e.contains(imp)) and
  (moduleName.matches("Crypto") or moduleName.matches("Crypto.%"))
select imp, "pyCrypto package has multiple security issues"

If your project is on LGTM.com, you can set up automated code review and add this query to your repository to ensure that you never accidentally introduce uses of the pyCrypto package.

Binding to all IP addresses with the ‘socket’ module¶

When you’re using the built-in socket module (for instance, to build a message sender service), it’s possible to bind to all available IPv4 addresses by specifying 0.0.0.0 as the IP address. When you do this, you essentially allow the service to accept connections from any IPv4 address provided that it is capable of reaching it through routing. Note that an empty string '' has the same effect as 0.0.0.0. Opening up your end point to all network interfaces is considered to be insecure.

For example:

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 6080))
s.bind(('192.168.0.1', 4040))
s.bind(('', 8888))

From the Python socket documentation:

A pair (host, port) is used for the AF_INET address family, where host is a string representing either a hostname in Internet domain notation like ‘daring.cwi.nl’ or an IPv4 address like ‘100.50.200.5’, and port is an integer.

The following custom query would find these insecure bindings:

/**
 * @name Binding a socket to all network interfaces
 * @description Binding a socket to all interfaces would
   open up traffic from any IPv4 address
 * and is therefore associated with security risks.
 * @kind problem
 * @tags security
 * @problem.severity error
 * @id py/bind-socket-all-network-interfaces
 */

import python

Value aSocket() { result.getClass() = Value::named("socket.socket") }

CallNode socketBindCall() {
  result = aSocket().attr("bind").(CallableValue).getACall()
}

string allInterfaces() { result = "0.0.0.0" or result = "" }

from CallNode call, string address
where
  call = socketBindCall() and
  address = call.getArg(0).getNode().(Tuple).getElt(0).(StrConst).getText() and
  address = allInterfaces()
select call.getNode(), "'" + address + "' binds a socket to all interfaces."

Using insecure SSL versions¶

There have been quite a few security changes in Python 3’s built-in ssl module. This is particularly true for versions 3.6 and 3.7. Visit Python SSL and TLS security to learn about evolution of the ssl module. SSL versions 2 are 3 are now considered to be insecure and official Python documentation discourages their use. Since Python 3.6, many protocol versions such as ssl.PROTOCOL_SSLv23 and ssl.PROTOCOL_SSLv2, are deprecated and OpenSSL has removed support for SSLv2.

From Python 3.6 onward, it is best to use the ssl.PROTOCOL_TLS protocol. From the docs page:

ssl.PROTOCOL_TLS: Selects the highest protocol version that both the client and server support.

Although you can specify the SSL version in an ssl.wrap_socket call, this was deprecated in version 3.7. Instead, the use of a more secure alternative is suggested by the Python docs:

Since Python 3.2 and 2.7.9, it is recommended to use the SSLContext.wrap_socket() instead of wrap_socket(). The top-level function is limited and creates an insecure client socket without server name indication or hostname matching.

There is a built-in query, Default version of SSL/TLS may be insecure, which finds uses of SSLContext.wrap_socket(). For earlier versions of Python, you want to make sure that you’re not using insecure versions of SSL such as ssl.PROTOCOL_SSLv2 or ssl.PROTOCOL_SSLv3. For this, there is another built-in query, Use of insecure SSL/TLS version, which finds insecure SSL/TLS versions both for pyOpenSSL.SSL (a Python wrapper around the OpenSSL library) and for the built-in ssl module.

Not validating certificates in HTTPS connections¶

During an HTTPS request, it is important to verify SSL certificates, which is exactly what any modern web browser does nowadays. Up to versions Python 2.7.9 (for Python 2) and Python 3.4.3 (for Python 3), CPython modules that dealt with HTTP interaction (such as httplib and urllib) did not verify the web site certificate against a trust store. This issue was registered as CVE-2014-9365 and is an example of CWE-295: Improper Certificate Validation which can potentially lead to a man-in-the-middle (MITM) attack.

A de-facto standard library used by the Python community for communicating over HTTP is requests. By default, it has SSL verification enabled, and a custom exception will be thrown if certificate verification fails. However, it is possible to disable the verification that TLS provides:

import requests
requests.get('https://example.com', verify=False)

To find HTTP requests that fail to verify the certificate, you can run the built-in query, Request without certificate validation.

Compromising privacy in universally unique identifiers¶

Universally unique identifiers (UUID) can be generated using the uuid module. The general recommendation is to use uuid1() or uuid4() to generate a unique identifier. However, uuid1() may compromise privacy because the UUID will include the computer’s network address.

uuid4(), in contrast, creates a random UUID and is simply a convenience function. From the CPython source code:

def uuid4():
    """Generate a random UUID."""
    return UUID(bytes=os.urandom(16), version=4)

Furthermore, there are some concerns about the “safety” of UUIDs. From the Python docs:

Depending on support from the underlying platform, uuid1() may or may not return a “safe” UUID. A safe UUID is one which is generated using synchronization methods that ensure no two processes can obtain the same UUID.

To find version 1 UUIDs generated by uuid.UUID(bytes=values, version=1) or uuid.uuid1(), as in the code snippet below,

import os
import uuid

id1 = uuid.uuid1()
id2 = uuid.UUID(bytes=os.urandom(16), version=1)
id3 = uuid.UUID(None, b'1234567891234567', None, None, None, 1)

we can run the following custom query:

/**
 * @name Using a uuid1 for generating UUID
 * @description uuid1 will use machine's network address
   for generating UUID and may compromise privacy.
 * @kind problem
 * @tags security
 * @problem.severity error
 * @id py/using-uuid1-for-UUID
 */

import python

from CallNode call
where
  call = Value::named("uuid.uuid1").getACall()
  or
  call = Value::named("uuid.UUID").getACall() and
  (
    call.getArgByName("version").getNode().(IntegerLiteral).getValue() = 1 or
    call.getArg(5).getNode().(IntegerLiteral).getValue() = 1
  )
select call, "uuid1 will use machine's network address and may compromise privacy."

Use of ‘assert’ statements to control program flow¶

The assert statement can be used in Python to indicate when executing the code would result in program failure or the retrieval of incorrect results. It is very common to use assert in unit and integration tests. However, assert statements are disabled when you run a Python program with optimization enabled. Running python -O program.py means that assert statements are ignored which may give a certain performance boost (either significant or negligible depending on how time-consuming the assert statements are).

This means that it can be unwise to rely on assert statements to define the logic of a program execution flow, if you plan to run your Python programs with optimization enabled or the code may be run outside of your control. Moreover, use of assert statements can be associated with security risks. Consider this Python code snippet:

def get_customers(user):
    """Get list of customers."""
    assert is_superuser(user), "User is not a member of superuser group"
    return db.lookup('customers')

When this program is run in optimized mode, the assert statement will be ignored and any user would be able to get a list of customers, regardless of whether they are a member of the superuser group or not.

This code can be rewritten more securely, without assert statements, as:

def get_customers(user):
    """Get list of customers."""
    if not is_superuser(user):
        raise PermissionError("User is not a member of superuser group")
    return db.lookup('customers')

Writing a custom query that catches all assert statements is trivial, however, all legitimate uses of assert would also be caught so you would need to look through each result manually. Optionally, you could search for assert statements used outside of tests. This query searches for all is_superuser function calls within the assert statements.

import python

from AstNode ast, Assert assert
where
  assert.contains(ast) and
  ast.(Call).getFunc().(Name).getId() = "is_superuser"
select ast

Parsing external files content into Python objects¶

Python provides multiple ways to read external files and load their content into Python objects. There are exec and eval built-in functions along with pickle (or cPickle in Python 2). External packages such as PyYAML can also be used to parse YAML file contents.

Because data from external sources may not be secure, the general security guidelines are that you should never unpickle or load by parsing any data received from an untrusted source.

There is a built-in query, Deserializing untrusted input, that highlights code that may be a security concern when unpickling and other deserialization happens. The general recommendation is to avoid constructing arbitrary Python objects via pickle or via a pyYAML package if the data comes from an untrusted source (the internet in particular). PyYAML, however, has the safe_load function which limits what can be loaded to simple Python objects.

In this Python code snippet, a class instance is created based on the YAML file contents (posted here as a string in yaml.load for brevity):

import yaml

class PasswordReader(object):

    def __init__(self, path):
        self.path = path

    def read(self):
        with open(self.path) as fh:
            return fh.readlines()

    def __repr__(self):
        return f"PasswordReader: {self.path}"


obj = yaml.load("""
!!python/object:__main__.PasswordReader
path: /etc/passwd
""")
print(obj)

Using yaml.safe_load would block construction of the class instance object unless it has been marked as safe. To be considered safe, it should inherit from yaml.YAMLObject and have a property yaml_loader set to yaml.SafeLoader.

This custom query was written to find unsafe yaml.load calls in your codebase:

/**
 * @name Using insecure yaml.load function
 * @description yaml.load function may be unsafe
   when loading data from untrusted sources
 * @kind problem
 * @tags security
 * @problem.severity error
 * @id py/using-yaml-load
 */

import python

from CallNode call
where call = Value::named("yaml.load").getACall()
select call.getNode(), "yaml.load function may be unsafe when
loading data from untrusted sources. Use yaml.safe_load instead."

This type of custom query, where you search for a specific function call, is fairly common. This approach can be used for any language feature that was considered safe a few years ago but the current recommendation is to use a newer version or an alternative, more robust one.

For examples of how you can write your own queries to find the use of a certain function or import of a module, review the following built-in queries:

In most cases, you want to make sure that the older or less secure function could not be used in the new code being written. The power of writing your own custom queries, however, lies in the ability to go beyond built-in queries and to look for the functions or class methods that you decide to blacklist. You can write a new query to trigger an alert if a blacklisted function is found. LGTM.com provides automatic code review functionality to prevent bugs from ever making it to your project. If you add custom queries to your repository, then you’ll also get alerts if a pull request contains functions or class methods that you’ve blacklisted.

References¶

Here are some references to Python security resources you may find useful.