The Hidden BOM: Debugging Python Errors Parsing Excel CSV Files

I was batch converting TIFFs using a CSV manifest when my script failed with a missing column error:

ERROR CSV missing required columns: ['ark_id']. Found: ['\ufeffark_id', 'work_id', 'http_path', ...]

The required headers were right there in my text editor, but Python saw \ufeffark_id, and not ark_id, as it screamed in its ERROR message. That \ufeff is a BOM (Byte Order Mark), which Excel silently included when saving the CSV file.

While the CSV file looked fine in my text editor and the code was straightforward:

import csv

with open('manifest_data.csv', encoding='utf-8') as f:
    reader = csv.DictReader(f, delimiter='\t')
    for row in reader:
        print(row['ark_id'])

The problem clearly revealed itself when I printed the actual keys:

dict_keys(['\ufeffark_id', 'work_id', 'http_path', 's8_path', 'img_index', 'collection_name'])

What is BOM?

A BOM is a special Unicode character (U+FEFF) placed at the beginning of a text file to indicate its encoding. For UTF-8, it’s the three-byte sequence EF BB BF. UTF-8 doesn’t actually need a BOM, unlike UTF-16, since there’s no byte order ambiguity, but Microsoft applications add one anyway as a signature and it is important to note that the BOM is invisible in most text editors and Excel itself.

Detecting BOM

Once I knew what to look for, checking was straightforward.

xxd -l 3 file.csv

If one sees efbb bf at the start, it is a BOM.

Alternatively, use file:

file file.csv

It will explicitly state UTF-8 Unicode (with BOM) text if one is present.

Fix

Python has a built-in encoding specifically for this: utf-8-sig. It automatically strips the BOM on read:

import csv

with open('manifest_data.csv', encoding='utf-8-sig') as f:
    reader = csv.DictReader(f, delimiter='\t')
    for row in reader:
        print(row['ark_id'])

Writing CSVs for Excel

The flip side: if one is generating CSVs that will be opened in Excel, one might want to include a BOM. Without it, Excel sometimes mangles non-ASCII characters. As such, use utf-8-sig when writing:

import csv

with open('output.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['名前', 'date', 'identifier'])
    writer.writerow(['地図', '1920', 'ark:/12345/x1'])

The BOM is one of those invisible gotchas that’s easy to miss in testing but breaks production code. It is helpful to check for it whenever CSV parsing behaves unexpectedly.

As such, for any CSV processing where the source might be Excel or another Windows application, I default to utf-8-sig:

def read_csv(filepath):
    with open(filepath, encoding='utf-8-sig') as f:
        return list(csv.DictReader(f))