This is one of the posts in a series of introductory Python articles. These posts explain programming principles in a simplified way and show very basic Python code that should help folks learning Python to get a better understanding of some concepts. In this post, I’ll share some notes about writing and working with classes in Python that beginners may find useful.

Using an existing class

Python class is just a template of an object that we are using to model a certain entity or real world object, its properties and its behavior. For instance, a Python class can represent a file on a computer file system; such a class will have certain properties (or attributes) and behaviors (or methods). This way, a file class may have properties such as name, owner, permissions, and edit time. Its methods would be actions this object could do: for instance, we could ask file class to be renamed or moved and it also could tell us who its parent is.

pathlib.Path is a class that you may have already used probably without reflecting too much about its properties. We will start exploring the class concept with a simple Path class.

from pathlib import Path

path = Path('/home/username/data/assets.csv')

We can create a new instance of the class Path by assigning a variable to be equal to a class instance with certain default values supplied. A class constructor (the initialization method) has a number of default values it can take when you create an instance of this class. This far, we have created a path object which is a Path class instance that has many properties such as name and suffix. Because this instance is also a Python object, we can access its properties using the dot notation.

print(panh.name, path.suffix)

Let’s create an instance of another Path class. Note that it is possible to use object’s methods on multiple class instances. For instance, it is possible to construct a Path object from two existing Path objects:

from pathlib import Path

user_dir = Path('/home/user')
assets_file = Path('data/assets.csv')
path = Path(user_dir, assets_file)
print(path)
# /home/user/data/assets.csv

The pathlib.Path object also has methods. In other words, we can perform a certain operation on this object (you can think of methods as functions that are run on this specific object) and get some result back, just like we do when we run a function.

# read file
path.read_text()
# remove file
path.unlink()

path = Path('/home/user/data/assets.csv')
path.relative_to('/home/user')
# PosixPath('data/assets.csv')

We have used the relative_to method of the pathlib.Path object and this method returned another object, PosixPath that has again own properties and methods.

Creating a new class

This far, we have only used existing classes that are present in pathlib module, but in fact you have already used many Python classes without ever thinking of that. For instance, variables you create in Python, such as lists and strings, are instances of the classes list and str, respectively.

To create a class in Python, you use a special keyword class. Let’s model an entity that is not present in pathlib to learn about classes and how they work. We will create a class that will represent an existing directory on the file system. Class definitions we write (just like we do with the function definitions using the def statements) must be executed before they can be used further in the code.

class Directory:
    """A file system directory."""

    def __init__(self, path):
        self.path = path


directory = Directory('/home/username/data')
print(directory)
# <__main__.Directory object at 0x00000284D5E6B7F0>

print(directory.path)
# /home/username/data

The only thing you need to specify is the special function __init__ (constructor) that will run every time you will create an instance of this class. Because we will work with an existing directory, we have to supply a path to that directory. Passing the file system path as the input argument to the __init__ function will let user specify the path to the directory when creating a class instance. The directory variable will represent the Directory class object.

A special word self that we used as the first argument is just a convention to refer to the object itself and you shouldn’t worry about it for now. What is good to know though is that you can add your own properties to the class that will provide some useful information to the user. Let’s add a property that will tell us the name of that directory:

class Directory:
    """A file system directory."""

    def __init__(self, path):
        self.path = path
        self.name = Path(path).name

directory = Directory('/home/username/data')
print(directory.name)
# data

However, when we printed the directory variable out earlier, it didn’t give us any useful information about the directory, only it’s internal object representation which was rather cryptic. Thankfully, in Python you can override how the objects will be represented when they are printed out as well as when you access the object inside the interactive console (REPL).

class Directory:
    """A file system directory."""

    def __init__(self, path):
        self.path = path
        self.name = Path(path).name

    def __repr__(self):
        return f'Directory("{self.path}")'

    def __str__(self):
        return str(self.path)

directory = Directory('/home/username/data')
print(directory)
# /home/username/data 

# inside REPL
directory
# Directory("/home/username/data")

As you will see, it is possible not only to get some basic information about the object, but actually call other functions and do a lot more. An object property can be a result of some calculation or data look-up. Let’s add another useful property that will report names of the files inside the directory:

class Directory:
    """A file system directory."""

    def __init__(self, path):
        self.path = path
        self.name = Path(path).name
        self.files = self.get_files()

    def get_files(self):
        return [item.name for item in Path(self.path).glob("*")
                if item.is_file()]

    def __repr__(self):
        return f'Directory("{self.path}")'

    def __str__(self):
        return str(self.path)

directory = Directory('/home/username/data')
print(directory.files)

Class instance methods

As you see, it can be very convenient for a developer to have a class defined in another module and then just import the module, create a class instance and start using this object. You may wonder why couldn’t we just write a function with the help of Path? We definitely could, but the true power of classes lies in how class instance objects can interact with each other. What can be implemented with the help of multiple functions can be often done with a more concise and elegant code of a class definition.

To see that in action, let’s define a new class - CsvFile that will represent a .csv file on disk.

class CsvFile:
    """A csv file."""

    def __init__(self, path):
        self.path = path

This is no different from what we have already did with the Directory class. Let’s add another property that will represent the data schema, that is, all the .csv file columns. The columns attribute will return the fields in the order they are stored in the .csv file (assuming the .csv file is comma-separated).

class CsvFile:
    """A csv file."""

    def __init__(self, path):
        self.path = path
        self.columns = self.get_columns()

    def get_columns(self):
        with open(self.path) as fh:
            return next(fh).strip().split(',')


csv_file = CsvFile("/home/username/data/assets.csv")
print(csv_file.columns)

Now, what if we would like to check if two .csv files have the same schema?

assets2019 = CsvFile("/home/username/data/assets2019.csv")
assets2020 = CsvFile("/home/username/data/assets2020.csv")

print(assets2019.columns == assets2020.columns)

But what if we want to let users of our CsvFile class to compare the schema of two .csv files without taking into account the order of columns? Thinking of a database table, the concept of a column order doesn’t make sense. Let’s create a method that would compare the columns of two .csv files:

class CsvFile:
    """A csv file."""

    def __init__(self, path):
        self.path = path
        self.columns = self.get_columns()

    def get_columns(self):
        with open(self.path) as fh:
            return next(fh).strip().split(',')

    def schema_match(self, other):
        """Compare the columns of two csv files."""
        return set(self.columns) == set(other.columns)


assets2019 = CsvFile("/home/username/data/assets2019.csv")
assets2020 = CsvFile("/home/username/data/assets2020.csv")

print(assets2019.schema_match(assets2020))
print(assets2019.columns == assets2020.columns)

The method schema_match() we wrote above takes as arguments self (the class instance itself) and other (another class instance object). It accesses the columns property for each of the objects and then compares the sets to make a decision whether the schema of the files matches. If two .csv files from the example above would have the same columns stored in different orders, the assets2019.columns == assets2020.columns would return False, but since their schema is identical, assets2019.schema_match(assets2020) would return True. We could also extend our method to provide more fine-grained control over the comparison protocol. Is comparison case sensitive? Are duplicate fields ignored? The method could even take an comparison configuration object where we could have defined all the comparison settings.

Advanced class behavior

When modeling a class behavior, it may be helpful to provide the logic of how the object should behave in various situations. For instance, it is possible to compare two Path objects:

Path('/data/path') == Path('/data/path')
# True
Path('/data/path') == Path('/data/file')
# False

For the .csv files, at least in our business domain specification, we could say that two files are identical if they have the same name and the same schema. To do this, we have to tell Python to use a special method when deciding whether two objects are the same with the __eq__ method:

from pathlib import Path

class CsvFile:
    """A csv file."""

    def __init__(self, path):
        self.path = path
        self.name = Path(path).name
        self.columns = self.get_columns()

    def get_columns(self):
        with open(self.path) as fh:
            return next(fh).strip().split(',')

    def schema_match(self, other):
        """Compare the columns of two csv files."""
        return set(self.columns) == set(other.columns)

    def __eq__(self, other):
        return self.name == other.name and self.schema_match(other)


print(CsvFile("/home/username/data/assets2019.csv") == CsvFile("/home/username/data/assets2020.csv"))

It can also be useful to use an object in a for loop. For instance, we can let users of the Directory class iterate through the class instance and access every file stored within it. In order to do this, one has to use a special __iter__ method.

from pathlib import Path

class Directory:
    """A file system directory."""

    def __init__(self, path):
        self.path = path
        self.name = Path(path).name
        self.files = self.get_files()

    def get_files(self):
        return [item.name for item in Path(self.path).glob("*")
                if item.is_file()]

    def __iter__(self):        
        for f in self.files:
            yield f


directory = Directory("/home/username/data")

for filename in directory:
    print(filename)

# would return files one by one
scanner = iter(directory)
next(scanner)

All you need to define essentially is what kind of iteration you want to provide your users with. The __iter__(), called generator function, works so that each time you use a yield statement, it will give you back the next item and then move to the next item in the list ready to be served. It may be helpful to learn a bit more about the generator expression as this technique may be necessary to use when dealing with large sequences or when the memory resources available are scarce.

Keep in mind that the scanner generator from the example above is different from the directory (being iterable) because you can only iterate over the scanner once. This is because generators do not store all the iterable values in the computer memory, but generate them on-demand. You could construct a list of files while iterating the generator either in a standard for loop or with the help of list comprehension:

scanner = iter(directory)
files = [item for item in scanner]

Keep in mind that when using the list comprehension, the entire list of items is created in memory. In contrast, while accessing the items in .files one by one, the generator accesses the items on-demand. This makes it possible to work with extremely large sequences without running out of memory.

Happy classing!


Published

Category

python-basics

Tags