Typesafe approach to CSV with Python

This post describes an approach to read and write CSV using Python, and doing so in a simple and typesafe way. We are going to use standard CSV Python module, hence there would be no need to install any new packages.

Let’s start with our model. A model is a class that represents a line in a CSV file. Because CSV is a tabular data format, we assume that all rows in a CSV file can be represented with the same model class. Each field of the model class represents a column in the CSV file. Each field has a type. Since we can represent a line in CSV file by concrete object we have type safety while writing to and reading from CSV file. Here is how a model class may look like.

import datetime as dt
from typing import Union


class Model:
    def __init__(
            self,
            int_field: Union[int, str] = None,
            string_field: str = None,
            datetime_field: Union[dt.datetime, str] = None):
        self.int_field: int = _to_int_or_none(int_field)
        self.string_field: str = string_field
        self.datetime_field: dt.datetime = _to_date_time_or_none(
            datetime_field)

There are few requirements that model class should satisfy to make all generalized code below work. However, if there is no need for generalization, then all of the requirements can be lifted.

  1. All field names must match CSV file header (column names).
  2. Constructor must take all fields as named arguments. Names of the arguments must match CSV file header (column names).
  3. A constructor argument value can be either of a given type (int, datetime and etc.) or a string, so before assigning constructor argument value to a model field the value may needs to be parsed from string.

Here are two examples of a CSV file that match the model above.

Example with a header.

int_fieldstring_fielddatetime_field
1hello 12020-10-30T10:41:45.968627
2hello 22020-10-30T10:42:49.559897+00:00
3hello 3
42020-10-30T10:41:45.968627
hello 52020-10-30T10:41:45.968627

Example without a header.

1hello 12020-10-30T10:41:45.968627
2hello 22020-10-30T10:42:49.559897+00:00
3hello 3
42020-10-30T10:41:45.968627
hello 52020-10-30T10:41:45.968627

How do we write CSV?

Writing CSV file is now simple and typesafe. Given we have data, it’s just a single function call.

import datetime as dt
from typing import List

import pytz
import src.csv_gen as csv_gen

# Prepare Date
utc_now = pytz.utc.localize(dt.datetime.utcnow())
utc_now_plus_1 = utc_now + dt.timedelta(days=1)

# Data is a collection of Model objects
data: List[Model] = [
    Model(int_field=1, string_field="1", datetime_field=utc_now),
    Model(int_field=2, string_field="2", datetime_field=utc_now_plus_1),
]

# Write to CSV, just one line
csv_gen.write_as_csv("data.csv", data)

How does it work?

CSV Python module allows to write a dictionary to CSV. If we can convert a Model object to a dictionary, then we can write any object directly to CSV. Luckily in Python we can do so simply iterating over all fields of an object and writing values to a dictionary. Private function _to_csv_safe_dict() does exactly that. In addition, we can implement custom logic to decide which fields to include and how to convert values to CSV friendly format.

import datetime as dt


def _to_csv_safe_dict(obj: object, classkey=None):
    if isinstance(obj, dict):
        data = {}
        for (k, v) in obj.items():
            data[k] = _to_csv_safe_dict(v, classkey)
        return data
    if isinstance(obj, (dt.datetime, dt.date)):
        return obj.isoformat()
    if hasattr(obj, "_ast"):
        return _to_csv_safe_dict(obj._ast())
    if hasattr(obj, "__iter__") and not isinstance(obj, str):
        return [_to_csv_safe_dict(v, classkey) for v in obj]
    if hasattr(obj, "__dict__"):
        data = {
            k: _to_csv_safe_dict(v, classkey)
            for k, v in obj.__dict__.items()
            if not callable(v) and not k.startswith("_")
        }
        if classkey and hasattr(obj, "__class__"):
            data[classkey] = obj.__class__.__name__
        return data
    return obj

Once we can convert any object to a dictionary, we can have a function that takes a List of objects and writes it to a CSV file. In case we do not want all object fields to be written, we can pass an optional argument columns to specify exact fields we would like to write.

import csv
import datetime as dt
from typing import List


def write_as_csv(file_path: str, data: List[object], columns: List[str] = []):
    if not data:
        raise ValueError("data is empty or None")
    if not columns:
        # Infer columns from the first item in data
        first_item = data[0]
        columns = [key for key, _ in first_item.__dict__.items()]

    with open(file_path, 'w', newline='') as file:
        writer = csv.DictWriter(
            file,
            fieldnames=columns,
            extrasaction="ignore")
        writer.writeheader()
        writer.writerows([_to_csv_safe_dict(data_item) for data_item in data])

How do we read CSV?

Reading CSV file is now very easy and typesafe.

import src.csv_gen as csv_gen
from typing import List


data: List[Model] = csv_gen.read_from_csv(
    path,
    Model,
    has_header=False)

How does it work?

CSV Python module can read CSV rows as collection of dictionaries. If we can convert a dictionary to an object while reading from CSV we will be able to construct and return a strongly typed collection of objects. How can we do that? The simplest approach is to read dictionary values and assign them to model fields in-place, unfortunately this is far from being a generalized approach because we can read only one type of data. Another way is to pass a dictionary to a model class constructor and let the Model class constructor read dictionary data and populate the fields. However, there is slightly more elegant solution. Since our Model class has named constructor arguments, we can simply spread a dictionary to the constructor arguments, then model class just needs to assign constructor arguments to the fields doing string values parsing if needed.

import csv
from typing import List


def read_from_csv(
        file_path: str,
        type_to_read: type,
        has_header=True,
        columns: List[str] = []):
    if not type_to_read and not columns:
        raise ValueError("type_to_read or columns is needed")

    if not columns:
        # Infer columns from type_to_read
        columns = [key for key, _ in type_to_read().__dict__.items()]

    rows = []
    with open(file_path, 'r', newline='') as file:
        reader = csv.DictReader(
            file,
            fieldnames=columns)
        rows = [row for row in reader]
    skip_header = has_header
    data = []

    for row in rows:
        if skip_header:
            skip_header = False
            continue
        # row is a dictionary. We spread dictionary key-value pairs
        # to constructor arguments
        data.append(type_to_read(**row))

    return data

Summary

The approach we just described reads and writes CSV files in generalized and typesafe way. It can be extended with more logic if necessary or be simplified if generalization is not needed. Feel free to adopt it for your needs, here is GitHub repository with all the code and tests PavelHudau/simple-safe-csv.

Posts created 28

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top