This is one of the few posts I wrote in 2019 when working for Semmle (later acquired by GitHub) that was originally published on the Semmle blog that was transformed. Python library for QL has changed quite a bit since then, however, many principles are still relevant and helpful for anyone who would want to learn more about QL. See CodeQL for Python to learn more.

Overview

In this tutorial, you’ll learn how to use QL to query a Python codebase and learn how to check for Python 2/3 compatibility. We’ll be writing alert queries, that is, queries that highlight issues in specific locations in your code. The tutorial assumes that you’re familiar with the basics of QL for Python. If not, you might want to read my previous post (Introducing the QL libraries for Python).

Python 2 and 3

As the official end of life of Python 2 approaches, more and more Python projects are being converted from Python 2 to Python 3. The majority of infrastructure projects are now on Python 3, and many are Python 3 only. At some point, you will likely need to upgrade your project. There are myriads of useful resources that can help you upgrade your project’s codebase. There are tools that can upgrade code in a semi-automatic fashion; there are linters and static code analysis tools that will help you spot code that’s not compatible with Python 3. There are also quite a few documents to help you learn what’s new in Python 3 and avoid the common pitfalls when you upgrade.

To learn more, visit the main What’s New In Python 3.0 reference page. To learn how to write code that’s compatible with both Python 2 and Python 3, visit Python-Future.

Upgrading a codebase to Python 3, or supporting both Python 2 and 3, can be a challenge. The Python 2 interpreter reports a SyntaxError for some of the new syntax features in Python 3. Some Python 2 features aren’t available in Python 3, so when the Python 3 interpreter encounters them it raises a runtime error or gives a different result.

For instance, the print statement was replaced by the print() function so running a module with a print statement under Python 3 will cause a SyntaxError. Using the print statement as if it were a function in Python 2, however, won’t raise a SyntaxError, but its behavior will be different:

  • Python 3
>>> print("value1", "value2")
value1 value2
  • Python 2
>>> print("value1", "value2")
('value1', 'value2')

In contrast, the long type was removed in Python 3 leaving only one built-in integer type named int. Hence, trying to use the long keyword in a module executed by a Python 3 interpreter, will cause a NameError at runtime:

  • Python 3
>>> isinstance(5, int)
True

>>> isinstance(5, long)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
NameError: name 'long' is not defined

Using QL

A series of QL queries is shown below, highlighting some of the issues found when working with Python 2 and 3 compatibility. We explain how the queries work so you can learn how to use the QL libraries for Python, which will help you to write your own custom queries.

A Python project can be analyzed using either a Python 2 or a Python 3 interpreter. To learn more, read How is the Python version identified? on LGTM.com.

Analysis run on LGTM will spot common errors using built-in queries. To find out which version of Python was used to analyze a codebase, you can use the built-in major_version and minor_version predicates:

import python
select major_version(), minor_version()

These predicates will come in handy later on when we will be trying to find issues in the code that are relevant only for Python 2 or for Python 3.

Built-in queries

Syntax error

Syntax errors are found by the built-in Syntax error query. They prevent a module being evaluated and thus imported. An attempt to import a module with invalid syntax will fail; a SyntaxError will be raised. Syntax errors are caused by invalid Python syntax, for example:

# variables cannot contain any symbol 
# other than a digit, a letter, and an underscore
variable$ = "value"

# attempt to use an invalid increment operator
value = 10
value++

# incorrect usage of lambda
print(lambda x: x += 10)

# invalid inequality test
print(source <> target)

Note that in Python 2, it’s okay to mix tabs and spaces for code indentation. However, in Python 3, a new TabError is raised when indentation contains an inconsistent use of tabs and spaces. This type of error is also caught by the syntax errors check.

Encoding error

Encoding errors are found by the built-in Encoding error query. They prevent a module being evaluated and thus imported. An attempt to import a module with an invalid encoding will fail; a SyntaxError will be raised. Note that in Python 2, the default encoding is ASCII.

Existing custom queries

In addition to the built-in queries that are part of the core LGTM suite, there are a few custom queries that the community of QL writers has contributed. I myself wrote a new custom query shortly after I joined Semmle while I was learning how the QL libraries for Python worked. This query, Use of 'return' or 'yield' outside a function, was first published in the public GitHub QL repository and later became a built-in query that is run on LGTM.com.

Writing new QL queries

New ‘raise from’ syntax

PEP 3109 – Raising Exceptions in Python 3000 and PEP 3134 – Exception Chaining and Embedded Tracebacks introduced new syntax for the raise statement: raise [expr [from expr]]. The optional from clause can be used to chain exceptions. When from is used, the second expression must be another exception class or instance. To learn more, visit The raise statement.

Since version 3.3, you can use None to suppress the chained exception, for example:

try:
    value = 1 / 0
except Exception:
    raise Exception() from None

However, a project style guide may discourage the suppression of exception chaining using from None, for example, to maintain backwards compatibility. If this is the case, you would want to find all such occurrences.

The QL libraries for Python contain classes that are useful for finding this syntax. These can be imported and used in a custom QL query. The easiest way to find this type of raise statement is to use the Raise class. Using this QL query, we can spot when the raise from None syntax is used:

import python

from Raise r
where r.getCause().getAFlowNode().pointsTo(Value::named("None"))
select r

The .getCause() method gives us the cause of the raise statement and it’s possible to find out what object this cause points to using the .pointsTo() method. In this case, we test whether this is a None object.

To extend our query, we could check whether a valid object is being used in the from part. The object can be either None or a valid exception class or instance.

For example, this raise statement has an invalid object so, a TypeError with the message, TypeError: exception causes must derive from BaseException, is raised when it’s run:

try:
    print(1 / 0)
except Exception as exc:
    raise RuntimeError("Something happened") from "Program stopped"

This QL query will find all raise ... from ... statements where the from object is invalid.

import python

from Raise r, Value v
where r.getCause().getAFlowNode().pointsTo(v) and
      v != Value::named("None") and
      not v.getClass().getASuperType() = Value::named("BaseException")
select r, v

A class instance is a legal exception type if it inherits from the BaseException class. Thus, this query would be able to spot when an invalid object type is used in the raise from clause.

Support for unicode in identifier names

In Python 2, only ASCII characters could be used in the names of Python identifiers including, but not limited to, variables, functions, and classes. Trying to define a variable café (e-acute) in Python 2, would result in a SyntaxError: invalid syntax.

In Python 3, with PEP 3131 – Supporting Non-ASCII Identifiers, this limitation was removed and now additional characters from outside the ASCII range (see the docs) could be used in identifier names. This code is valid in Python 3:

café = object()
print(café)

However, a project style guide may prohibit the use of non-ASCII characters in identifiers to maintain backwards compatibility. To find identifiers that break this rule we have to find all identifiers that contain characters other than letters, numbers, and the underscore symbol. This can be done using a regular expression. We don’t have to worry about the validity of identifier names; a built-in query already finds any syntax errors, such as variable names that don’t start with an underscore or a letter. Since this check is relevant only for Python 3, a condition of major_version() = 3 is included. In Python 2 this issue would be caught by the query that reports all SyntaxError cases.

This QL query finds all non-ASCII Python identifiers.

import python

from string identifier, AstNode n
where 
    (
    identifier = n.(Name).getId()
    or
    identifier = n.(Attribute).getName()
    ) and not identifier.regexpMatch("[a-zA-Z_][a-zA-Z_0-9]*")
    and major_version() = 3
select n, "Non ASCII character in identifier's name"

In this query, the Name class represents the names of identifiers. The Attribute class represents the names of attribute expressions, for example, a class method. We need to use the AstNode class to access the location of each identifier in the code. However, the AstNode class doesn’t provide the identifier’s name as a string that we can test using a regular expression. To get the name as a string, we call the member predicates .getId() and .getName(). Since these are defined for a more specific type, we need to use a type cast.

Dive in: This could have been done using postfix and prefix casts. Visit the Casts help page to learn more.

Comparing objects of different types

In Python 2, objects of different types are ordered by their type names (with the exception of numbers). This results in behavior that can puzzle developers who are unfamiliar with this implementation detail.

  • Python 2:
>>> print 50 < "Text"
True

>>> [10, 20] > 'Text' 
False

This comparison essentially compares the types of the objects, that is: 'int' < 'str'. This is True because the word representing type int starts with i which is smaller than s - the str type (using lexicographic order). Likewise, because 'list' > 'str' is False, comparing a list object to a string object would return False.

In Python 3, if you use ordering comparison operators when the operands don’t have a natural ordering that makes sense, a TypeError exception is raised. This implies that there can be Python 2 code which may compare objects of different types and this would not be an issue until you run the program with a Python 3 interpreter.

For instance, this valid Python 2 code would fail in Python 3:

data = [10, 20, 30]
mapper = {"Source": "Target"}
print(data > mapper)
print(data < mapper)

We can use QL to write a custom query that finds comparisons of invalid data types.

import python

ClassValue orderedType() {
  exists(string typename | result = Value::named(typename) |
    typename = "str" or typename = "float" or typename = "list"
  )
}

from
  CompareNode compare, ControlFlowNode left, ControlFlowNode right, 
  Context ctx, Value lval, Value rval, Cmpop op

where
  compare.operands(left, op, right) and
  (
    op instanceof Lt or
    op instanceof LtE or
    op instanceof Gt or
    op instanceof GtE
  ) and
  left.pointsTo(ctx, lval, _) and
  right.pointsTo(ctx, rval, _) and
  lval.getClass() != rval.getClass() and
  lval.getClass() = orderedType() and
  rval.getClass() = orderedType()

select compare, "Invalid comparison of objects due to type difference"

At this point it might be useful to refactor the code above because the where clause gets too difficult to read. We can define a helper predicate, incomparableTypes, that would hold if comparison expressions are of incompatible types:

import python

predicate incomparableTypes(ClassValue a, ClassValue b) {
    not a = b and 
    a = orderedType() and
    b = orderedType()
}

ClassValue orderedType() {
  exists(string typename | result = Value::named(typename) |
    typename = "str" or typename = "float" or typename = "list"
  )
}

from
  CompareNode compare, ControlFlowNode left, ControlFlowNode right, 
  Context ctx, Value lval, Value rval, Cmpop op

where
  compare.operands(left, op, right) and
  (
    op instanceof Lt or
    op instanceof LtE or
    op instanceof Gt or
    op instanceof GtE
  ) and
  left.pointsTo(ctx, lval, _) and
  right.pointsTo(ctx, rval, _) and
  incomparableTypes(lval, rval)

select compare, "Invalid comparison of objects due to type difference"

The left and right expressions of the comparison can be inspected to check what type they point to using the .pointsTo() method. We use the don’t care variable _ to state that we don’t care what kind of Value the left and right expressions point to, however, they must be of a certain type.

The query above currently only supports comparing strings, floats, and lists. However, it is easy to extend it just by copying the relevant where section and changing the class types. For instance, to extend this query to include the comparison of integer objects, you would just need to add the following section:

...
ClassValue orderedType() {
  exists(string typename | result = Value::named(typename) |
    typename = "str" or typename = "float" or 
    typename = "list" or typename = "int"
  )
}
...

Octal literals syntax support

Octal literals in Python 3 can no longer be defined in the form of a number starting with 0, such as 0562, as they could be in Python 2. Python 2 has two methods for defining octal literals:

>>> print(0562 == 0o562)
True

Python 3 only supports the second of these syntaxes and using 0562 would cause a SyntaxError. Instead, you need to use a zero followed by a lower or upper case o (that is, o and O), for example, 0o562 or 0O562 . The upper case O looks very similar to zero (0) so using a lowercase o may be preferable.

Therefore, it can be helpful to search for octal literals in Python 2 that don’t use o to avoid issues after converting the codebase to Python 3. Fortunately, there’s already an existing query - Confusing octal literal - which finds octal literals with a leading 0 because they can easily be misread as decimal values. This query does just what we need.

It’s worth bearing in mind that this query doesn’t raise alerts for octal literals that are of 4, 5, or 7 digits in length. These are ignored because Python code may include Unix permission mode octals which can be safely ignored. Here we want to raise an alert for all octal literals, so we simply remove the part that filters out octals of a certain length. This QL query finds all octal literals that would raise SyntaxError in Python 3:

import python

predicate is_old_octal(IntegerLiteral i) {
  exists(string text | text = i.getText() |
    text.charAt(0) = "0" and
    not text = "00" and
    text.charAt(_) != "0" and
    exists(text.charAt(1).toInt())
  )
}

from IntegerLiteral i
where major_version() = 3 and is_old_octal(i)
select i, "Invalid octal literal"

In this query we take advantage of the exists quantifier to define a predicate which holds for any integer literal that starts with a zero digit.

Dive in: Visit the Explicit quantifiers help page to learn more about quantifiers in QL.

Delimiter in numeric literals

Python 3.6 supports using _ as a delimiter in numeric literals. This functionality was introduced in PEP 515 – Underscores in Numeric Literals. This is an example of how this works in Python 3.6:

>>> 5_000.46 == 5000.46
True

>>> 5_000 + 1_000 == 6000
True

If your project style guide prohibits using this feature, for instance, for consistency with the Python 2 code, then you could write a custom QL query that would be able to find code where _ is used in numeric literals. Running code with underscores in numeric literals using a Python 2 interpreter would raise a SyntaxError.

import python

predicate hasUnderscore(Num n) { 
    exists(int i | n.getText().charAt(i) = "_")
}

string numValue(Num n) {
  result = n.(IntegerLiteral).getValue().toString() or
  result = n.(FloatLiteral).getValue().toString()
}

from Num num
where hasUnderscore(num)
select num, num.getText() as AsInCode, numValue(num) as AsToReader

Previously, we’ve only included two items in the select statement, however, you can return an arbitrary number of items. The .getText() method gives the actual source code (for example, 5_000) whereas the numValue predicate gives the string representation of the literal with underscores removed (for example, 5000). Being able to return multiple items within the select statement is extremely handy during the debugging and query writing process.

If your project style guide is more relaxed and permits having underscores in integers only, but prohibits using underscore in floats, you can adjust the query to work solely with floats:

import python

predicate hasUnderscore(Num n) { 
    exists(int i | n.getText().charAt(i) = "_")
}

from Num num
where hasUnderscore(num)
    and not num instanceof IntegerLiteral
select num

Long integer type is not supported by Python 3

With the implementation of PEP 237 – Unifying Long Integers and Integers, the long type was merged with the int type. This means that having integer literals with L, for example, 10560L in Python 3 would raise a SyntaxError at runtime. To spot integers that wouldn’t be compatible with Python 3 in your Python 2 project, you can use this custom QL query:

import python

string getLongPostfix() { result = "L" or result = "l" }

from IntegerLiteral num
where num.getText().charAt(num.getText().length() - 1) = getLongPostfix()
select num

Dive in: .charAt string method is implemented using Java String.charAt and doesn’t support negative indexing.

The ‘cmp’ parameter for ‘sorted(list)’ is no longer supported

Running the valid Python 2 code in the example below using a Python 3 interpreter would result in a TypeError because cmp is no longer a supported keyword argument for the sorted function. Visit The Old Way Using the cmp Parameter to learn more.

def compare_as_ints(a, b):
    return a - b

sorted([50, 30, 40, 20, 10], cmp=compare_as_ints)

To spot this issue in Python code, we would need to find all calls to the built-in sorted function and see if the cmp keyword argument is being passed. This QL query will find all calls to the sorted function where the keyword argument cmp has been used.

import python
from CallNode call
where
   Value::named("sorted").getACall() = call and
   exists(call.getArgByName("cmp"))
select call, "Call to sorted built-in function with cmp keyword argument."

The CallNode class represents all calls in the code. We use this because we aren’t interested in function definitions (which are accessed through the Function class) but in function calls. Once we’ve got the sorted built-in function, it’s just a matter of finding sorted() calls with the cmp keyword argument supplied. You could reuse this QL query to find other built-in functions where the signature varies between Python versions.

Methods ‘dict.iterkeys()’, ‘dict.iteritems()’ and ‘dict.itervalues()’ are deprecated

An attempt to access any of these dictionary methods would raise an AttributeError when running the code against a Python 3 interpreter:

data = {1: 10, 2: 20}
for k, v in data.iteritems():
    print(k, v)

Therefore, we might want to write a QL query to spot when those methods access an object of dict type.

import python

string unsupportedDictMethod() {
  result = "iteritems" or
  result = "iterkeys" or
  result = "itervalues"
}

from Attribute attr, Value v
where
  attr.getValue().getAFlowNode().pointsTo(v) and
  v.getClass() = Value::named("dict") and
  attr.getAttr() = unsupportedDictMethod()
select attr, "A deprecated dictionary method was used"

As before, the AstNode root class gives us access to all elements of the source code. The Attribute class gives us access to all attributes that are accessed. The attribute object is tied to the object and this tie can be identified using the .pointsTo() method. Once we’ve found all the dictionary attributes throughout the source code, we leave only those that are no longer supported using a convenience predicate, unsupportedDictMethod.

This is the end of this tutorial. The queries posted in this post can be executed using the LGTM.com query console, however, it’s also possible to run the queries locally using Eclipse. Visit Running queries in your IDE to learn more. I hope you enjoy trying out QL on your own projects! If you have any questions, don’t hesitate to ask on the community forum.

Happy Querying!


Published

Last Updated

Category

QL

Tags