{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Ways To Store and Read Data: Binary Files\n", "\n", "In our last reading we talked about plaintext files: files that store data in a human-readable format. In this reading, we will talk about the second type of file you are likely to come across in your career—binary files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Binary files differ from plaintext files in that the way that the 1s and 0s in the file are meant to be interpreted is not dictated by common text encodings—like ASCII or Unicode—where for example the number 1 is always represented by `00110001`, 2 by `00110010`, 3 by `001100011`, etc.\n", "\n", "Instead, binary data files can only be interpreted by software specifically written to interpret the formatted binary file you're working with, like Microsoft Excel. As a result, if you try and open them in a normal text editor—which will try and interpret the 1s and 0s as Unicode, you'll see jibberish. \n", "\n", "To illustrate, let's save a version of our small world dataset to a binary `.dta` format, then try and open it in our VS Code text editor:\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "world = pd.read_csv(\"data/world-very-small.csv\")\n", "world.to_stata(\"data/world-very-small.dta\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now if I try and open that `.dta` file, the first thing that will happen as I will see is this warning:\n", "\n", "\n", "\n", "And if I ask it to open the file anyway, all I see is this:\n", "\n", "\n", "\n", "But that's not because the file is corrupt—indeed, if I asked pandas to open that file with the proper function, we get back our usual table:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | index | \n", "country | \n", "region | \n", "gdppcap08 | \n", "polityIV | \n", "
---|---|---|---|---|---|
0 | \n", "0 | \n", "Brazil | \n", "S. America | \n", "10296 | \n", "18 | \n", "
1 | \n", "1 | \n", "Germany | \n", "W. Europe | \n", "35613 | \n", "20 | \n", "
2 | \n", "2 | \n", "Mexico | \n", "N. America | \n", "14495 | \n", "18 | \n", "
3 | \n", "3 | \n", "Mozambique | \n", "Africa | \n", "855 | \n", "16 | \n", "
4 | \n", "4 | \n", "Russia | \n", "C&E Europe | \n", "16139 | \n", "17 | \n", "
5 | \n", "5 | \n", "Ukraine | \n", "C&E Europe | \n", "7271 | \n", "16 | \n", "