{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parquet\n",
    "\n",
    "When it comes to storing tabular data in Python, there are a lot of choices, many of which we've talked about before (HDF5, CSV, dta, etc.). However, the storage format I think it best today (October 2023) is `parquet`.\n",
    "\n",
    "`parquet` has a number of strengths:\n",
    "\n",
    "- **It preserves type information**: Unlike a CSV, `parquet` files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it. \n",
    "- **It's fast:** That type information means when loading, Python doesn't have to try and figure out the types of each column.\n",
    "- **It's small:** `parquet` compresses your data automatically (and no, that doesn't slow it down -- it fact it makes it faster. The reason is that getting data from memory is such a comparatively slow operation, it's faster to load compressed data to RAM and then decompress it than to transfer larger uncompressed files). \n",
    "- **It's portable:** `parquet` is not a Python-specific format -- it's an Apache Software Foundation standard.\n",
    "- **It's built for distributed computing:** `parquet` was actually invented to support Hadoop distributed computing. \n",
    "\n",
    "And as of pandas 2.1, one of the libraries that powers it (`pyarrow`) comes bundled with pandas!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using parquet\n",
    "\n",
    "Here's some code examples for saving and loading `parquet` files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make a toy dataset\n",
    "\n",
    "import pandas as pd\n",
    "import numpy.random as npr\n",
    "import random\n",
    "import string\n",
    "\n",
    "size = 100000\n",
    "\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"a\": npr.randint(0, 10000, size=size),\n",
    "        \"b\": [\n",
    "            \"\".join(random.choices(string.ascii_uppercase, k=10)) for i in range(size)\n",
    "        ],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3576</td>\n",
       "      <td>CYPFOXEUOK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8934</td>\n",
       "      <td>MTBQBRTSNG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7428</td>\n",
       "      <td>HDOTAHEISG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2046</td>\n",
       "      <td>DZVSMYDWBP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>7052</td>\n",
       "      <td>TBXYGKMYIP</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      a           b\n",
       "0  3576  CYPFOXEUOK\n",
       "1  8934  MTBQBRTSNG\n",
       "2  7428  HDOTAHEISG\n",
       "3  2046  DZVSMYDWBP\n",
       "4  7052  TBXYGKMYIP"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a       int64\n",
       "b    category\n",
       "dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We can store \"b\" as a categorical (analogous to Factors in R).\n",
    "# This is helpful, for reasons we'll discuss on Tuesday,\n",
    "# but categoricals are only preserved by fastparquet.\n",
    "df[\"b\"] = pd.Categorical(df[\"b\"])\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/pyarrow/pandas_compat.py:373: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.\n",
      "  if _pandas_api.is_sparse(col):\n"
     ]
    }
   ],
   "source": [
    "# Save\n",
    "df.to_parquet(\"test.parquet\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3576</td>\n",
       "      <td>CYPFOXEUOK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8934</td>\n",
       "      <td>MTBQBRTSNG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7428</td>\n",
       "      <td>HDOTAHEISG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2046</td>\n",
       "      <td>DZVSMYDWBP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>7052</td>\n",
       "      <td>TBXYGKMYIP</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      a           b\n",
       "0  3576  CYPFOXEUOK\n",
       "1  8934  MTBQBRTSNG\n",
       "2  7428  HDOTAHEISG\n",
       "3  2046  DZVSMYDWBP\n",
       "4  7052  TBXYGKMYIP"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reload\n",
    "df2 = pd.read_parquet(\"test.parquet\")\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a       int64\n",
       "b    category\n",
       "dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Note that column b is still a categorical.\n",
    "\n",
    "df2.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also easily load subsets of columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CYPFOXEUOK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>MTBQBRTSNG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>HDOTAHEISG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>DZVSMYDWBP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>TBXYGKMYIP</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            b\n",
       "0  CYPFOXEUOK\n",
       "1  MTBQBRTSNG\n",
       "2  HDOTAHEISG\n",
       "3  DZVSMYDWBP\n",
       "4  TBXYGKMYIP"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.read_parquet(\"test.parquet\", columns=[\"b\"])\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chunking with parquet\n",
    "\n",
    "Parquet allows chunking, but not quite as easily as you can chunk a csv. \n",
    "\n",
    "First, you have to save your file into chunks of a size you want. Parquet always saves rows as chunks, but by default each chunk has 50,000,000 rows, which (if you're chunking) may be too many:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/pyarrow/pandas_compat.py:373: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.\n",
      "  if _pandas_api.is_sparse(col):\n"
     ]
    }
   ],
   "source": [
    "# Save into 50,000 row chunks,\n",
    "# so we should get file saved into two chunks.\n",
    "\n",
    "df.to_parquet(\"test.parquet\", row_group_size=50_000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "          a           b\n",
      "0      3576  CYPFOXEUOK\n",
      "1      8934  MTBQBRTSNG\n",
      "2      7428  HDOTAHEISG\n",
      "3      2046  DZVSMYDWBP\n",
      "4      7052  TBXYGKMYIP\n",
      "...     ...         ...\n",
      "49995  2048  TQAKZQYDAW\n",
      "49996  2461  LAQXYKGFDK\n",
      "49997  9509  BHVESYGUML\n",
      "49998  3626  WGONPSEPYT\n",
      "49999   681  HKIDHJBLZQ\n",
      "\n",
      "[50000 rows x 2 columns]\n",
      "          a           b\n",
      "0      6435  XFOYVSEBHR\n",
      "1      3175  UNHTHMZUNA\n",
      "2      1083  EPVGFGZTSE\n",
      "3      7424  KWAHGCUPYJ\n",
      "4       296  RONMJTKGEA\n",
      "...     ...         ...\n",
      "49995  8489  BDWRLTZLSM\n",
      "49996  3943  WPRFOWTWLQ\n",
      "49997  9784  LUBQICKYQA\n",
      "49998  6256  EUITUAVZQE\n",
      "49999  9411  RPGOEHYMLW\n",
      "\n",
      "[50000 rows x 2 columns]\n"
     ]
    }
   ],
   "source": [
    "# Then we have to read it in using the `fastparquet`\n",
    "# library itself (there's no way to do this directly from\n",
    "# pandas I'm afraid):\n",
    "\n",
    "from fastparquet import ParquetFile\n",
    "\n",
    "pf = ParquetFile(\"test.parquet\")\n",
    "\n",
    "# Iterates over row groups\n",
    "for rg in pf.iter_row_groups():\n",
    "    print(rg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>643</th>\n",
       "      <td>6558</td>\n",
       "      <td>EZLQUOMYBN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31920</th>\n",
       "      <td>1278</td>\n",
       "      <td>CBGUFGTRKI</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44151</th>\n",
       "      <td>7382</td>\n",
       "      <td>QEEXGJDDHX</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35662</th>\n",
       "      <td>5396</td>\n",
       "      <td>JPSGFWQHTU</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15697</th>\n",
       "      <td>7468</td>\n",
       "      <td>GTTSBHPKLM</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25991</th>\n",
       "      <td>6310</td>\n",
       "      <td>GLTBRLPEKU</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7456</th>\n",
       "      <td>5502</td>\n",
       "      <td>KNBMNVMWWA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49984</th>\n",
       "      <td>5758</td>\n",
       "      <td>BNTBTKXBDQ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41328</th>\n",
       "      <td>8714</td>\n",
       "      <td>AVCGRMQVIN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43772</th>\n",
       "      <td>8390</td>\n",
       "      <td>WPQYXPULAH</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          a           b\n",
       "643    6558  EZLQUOMYBN\n",
       "31920  1278  CBGUFGTRKI\n",
       "44151  7382  QEEXGJDDHX\n",
       "35662  5396  JPSGFWQHTU\n",
       "15697  7468  GTTSBHPKLM\n",
       "25991  6310  GLTBRLPEKU\n",
       "7456   5502  KNBMNVMWWA\n",
       "49984  5758  BNTBTKXBDQ\n",
       "41328  8714  AVCGRMQVIN\n",
       "43772  8390  WPQYXPULAH"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# So you can filter and work with each group if you want:\n",
    "all_rows_w_even_a = []\n",
    "for rg in pf.iter_row_groups():\n",
    "    rg = rg.loc[(rg[\"a\"] % 2) == 0]\n",
    "    all_rows_w_even_a.append(rg)\n",
    "\n",
    "even_rows = pd.concat(all_rows_w_even_a)\n",
    "even_rows.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clean up\n",
    "import os\n",
    "\n",
    "os.remove(\"test.parquet\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}