4 March, 2020

Using Blob storage GZip files in Python Azure Functions

I recently had a project where I needed to process a lot of GZipped files, and decided to give Azure Functions a whirl.

I had the data flowing into Azure blob storage, but when I tried to process the GZip files using the below, I kept getting an error: Exception: OSError: Not a gzipped file (b'\x1f\xef').

def main(msg: func.QueueMessage, inputblob: func.InputStream) -> None:
    filename = msg.get_body().decode('utf-8')
    logging.info(f"Working on queue item: {filename}")

    with gzip.open(inputblob, mode='rt', encoding="utf-8") as gzf:
        # Do stuff

By logging out with print(inputblob.read()), I saw something completely unexpected: the first few bytes of the input stream were \x1f\xef\xbf\xbd\x08\x00\x00\x00\x00\x00\x00\x00\xef\xbf\xbd, which had lots of UTF-8 BOM sequences (\xef\xbf\xbd).

Long story short, turns out Azure was trying to serve the GZip binary as a UTF-8 string, and there's a hidden option you can set in your function.json called dataType, which stops this sillyness.

{
    "type": "blob",
    "direction": "in",
    "name": "inputblob",
    "dataType": "binary",

    "path": "files/{queueTrigger}",
    "connection": "MyStorage"
},

Once I set the dataType to binary, gzip.open() worked fine.