{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Parquet\n", "\n", "When it comes to storing tabular data in Python, there are a lot of choices, many of which we've talked about before (HDF5, CSV, dta, etc.). However, the storage format I think it best today (October 2023) is `parquet`.\n", "\n", "`parquet` has a number of strengths:\n", "\n", "- **It preserves type information**: Unlike a CSV, `parquet` files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it. \n", "- **It's fast:** That type information means when loading, Python doesn't have to try and figure out the types of each column.\n", "- **It's small:** `parquet` compresses your data automatically (and no, that doesn't slow it down -- it fact it makes it faster. The reason is that getting data from memory is such a comparatively slow operation, it's faster to load compressed data to RAM and then decompress it than to transfer larger uncompressed files). \n", "- **It's portable:** `parquet` is not a Python-specific format -- it's an Apache Software Foundation standard.\n", "- **It's built for distributed computing:** `parquet` was actually invented to support Hadoop distributed computing. \n", "\n", "And as of pandas 2.1, one of the libraries that powers it (`pyarrow`) comes bundled with pandas!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using parquet\n", "\n", "Here's some code examples for saving and loading `parquet` files:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Make a toy dataset\n", "\n", "import pandas as pd\n", "import numpy.random as npr\n", "import random\n", "import string\n", "\n", "size = 100000\n", "\n", "df = pd.DataFrame(\n", " {\n", " \"a\": npr.randint(0, 10000, size=size),\n", " \"b\": [\n", " \"\".join(random.choices(string.ascii_uppercase, k=10)) for i in range(size)\n", " ],\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
03576CYPFOXEUOK
18934MTBQBRTSNG
27428HDOTAHEISG
32046DZVSMYDWBP
47052TBXYGKMYIP
\n", "
" ], "text/plain": [ " a b\n", "0 3576 CYPFOXEUOK\n", "1 8934 MTBQBRTSNG\n", "2 7428 HDOTAHEISG\n", "3 2046 DZVSMYDWBP\n", "4 7052 TBXYGKMYIP" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a int64\n", "b category\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can store \"b\" as a categorical (analogous to Factors in R).\n", "# This is helpful, for reasons we'll discuss on Tuesday,\n", "# but categoricals are only preserved by fastparquet.\n", "df[\"b\"] = pd.Categorical(df[\"b\"])\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/pyarrow/pandas_compat.py:373: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.\n", " if _pandas_api.is_sparse(col):\n" ] } ], "source": [ "# Save\n", "df.to_parquet(\"test.parquet\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
03576CYPFOXEUOK
18934MTBQBRTSNG
27428HDOTAHEISG
32046DZVSMYDWBP
47052TBXYGKMYIP
\n", "
" ], "text/plain": [ " a b\n", "0 3576 CYPFOXEUOK\n", "1 8934 MTBQBRTSNG\n", "2 7428 HDOTAHEISG\n", "3 2046 DZVSMYDWBP\n", "4 7052 TBXYGKMYIP" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reload\n", "df2 = pd.read_parquet(\"test.parquet\")\n", "df2.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a int64\n", "b category\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that column b is still a categorical.\n", "\n", "df2.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also easily load subsets of columns:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
b
0CYPFOXEUOK
1MTBQBRTSNG
2HDOTAHEISG
3DZVSMYDWBP
4TBXYGKMYIP
\n", "
" ], "text/plain": [ " b\n", "0 CYPFOXEUOK\n", "1 MTBQBRTSNG\n", "2 HDOTAHEISG\n", "3 DZVSMYDWBP\n", "4 TBXYGKMYIP" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.read_parquet(\"test.parquet\", columns=[\"b\"])\n", "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chunking with parquet\n", "\n", "Parquet allows chunking, but not quite as easily as you can chunk a csv. \n", "\n", "First, you have to save your file into chunks of a size you want. Parquet always saves rows as chunks, but by default each chunk has 50,000,000 rows, which (if you're chunking) may be too many:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/pyarrow/pandas_compat.py:373: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.\n", " if _pandas_api.is_sparse(col):\n" ] } ], "source": [ "# Save into 50,000 row chunks,\n", "# so we should get file saved into two chunks.\n", "\n", "df.to_parquet(\"test.parquet\", row_group_size=50_000)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " a b\n", "0 3576 CYPFOXEUOK\n", "1 8934 MTBQBRTSNG\n", "2 7428 HDOTAHEISG\n", "3 2046 DZVSMYDWBP\n", "4 7052 TBXYGKMYIP\n", "... ... ...\n", "49995 2048 TQAKZQYDAW\n", "49996 2461 LAQXYKGFDK\n", "49997 9509 BHVESYGUML\n", "49998 3626 WGONPSEPYT\n", "49999 681 HKIDHJBLZQ\n", "\n", "[50000 rows x 2 columns]\n", " a b\n", "0 6435 XFOYVSEBHR\n", "1 3175 UNHTHMZUNA\n", "2 1083 EPVGFGZTSE\n", "3 7424 KWAHGCUPYJ\n", "4 296 RONMJTKGEA\n", "... ... ...\n", "49995 8489 BDWRLTZLSM\n", "49996 3943 WPRFOWTWLQ\n", "49997 9784 LUBQICKYQA\n", "49998 6256 EUITUAVZQE\n", "49999 9411 RPGOEHYMLW\n", "\n", "[50000 rows x 2 columns]\n" ] } ], "source": [ "# Then we have to read it in using the `fastparquet`\n", "# library itself (there's no way to do this directly from\n", "# pandas I'm afraid):\n", "\n", "from fastparquet import ParquetFile\n", "\n", "pf = ParquetFile(\"test.parquet\")\n", "\n", "# Iterates over row groups\n", "for rg in pf.iter_row_groups():\n", " print(rg)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
6436558EZLQUOMYBN
319201278CBGUFGTRKI
441517382QEEXGJDDHX
356625396JPSGFWQHTU
156977468GTTSBHPKLM
259916310GLTBRLPEKU
74565502KNBMNVMWWA
499845758BNTBTKXBDQ
413288714AVCGRMQVIN
437728390WPQYXPULAH
\n", "
" ], "text/plain": [ " a b\n", "643 6558 EZLQUOMYBN\n", "31920 1278 CBGUFGTRKI\n", "44151 7382 QEEXGJDDHX\n", "35662 5396 JPSGFWQHTU\n", "15697 7468 GTTSBHPKLM\n", "25991 6310 GLTBRLPEKU\n", "7456 5502 KNBMNVMWWA\n", "49984 5758 BNTBTKXBDQ\n", "41328 8714 AVCGRMQVIN\n", "43772 8390 WPQYXPULAH" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# So you can filter and work with each group if you want:\n", "all_rows_w_even_a = []\n", "for rg in pf.iter_row_groups():\n", " rg = rg.loc[(rg[\"a\"] % 2) == 0]\n", " all_rows_w_even_a.append(rg)\n", "\n", "even_rows = pd.concat(all_rows_w_even_a)\n", "even_rows.sample(10)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Clean up\n", "import os\n", "\n", "os.remove(\"test.parquet\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }