IBepyProgrammer

File Theory

5 min read
File Theory
In this article, we will learn how to manage and work with different file formats using Python programming language. These include CSV, Pickle, JSON, and PDF files.

1. CSV files

CSV stands for Comma-separated values. This simply means all the data within the CSV files are separated by commas.

CSV files are plain text files that store large tables.

We can open CSV files using Notepad and other text editors. Excel can be used to analyze data from CSV files.

We can also do advanced analytics and visualization of CSV files using programming languages such as Python and Python libraries such as Numpy, seaborn, Matplotlib, and Pandas.

Comma-separated values files end with the extension .csv

In this example, we will learn how to create CSV files using the Python programming language.

  • We will first learn how to generate random values in a CSV file.

  • Load data from the CSV file

  • Use libraries in Python to work with CSV files.

Getting started with CSV files

  1. To generate a CSV file in Python, we first import the Numpy library into our code editor. In my example, I will be using Jupyter Notebook.

  2. In the Jupyter Notebook cell, import numpy as follows:


import numpy as np

note that if you do not have the numpy library installed you can install it by running pip install numpy on your terminal or command prompt.

  1. Let's start by generating data that contains 100 rows and 5 columns. This will be achieved by using the imported numpy library as follows:

generated_data = np.random.random((100, 5))

  1. We can then save the data in the CSV format by:

np.savetxt("table_data.csv", generated_data, fmt="%2f", delimiter=",", header = "title1,title2, title3, title4, title5")

  1. We now have a CSV file called table_data.csv with 100 rows and 5 columns.

  2. Reading the CSV file can be done using the numpy library but we could also use the pandas library in this example we will be using numpy.


read_file = np.loadtxt("table_data.csv", delimiter=",")

read_file[:5, :]

  • The output for the cell above will be the first five rows in the document we generated using numpy.

I have provided a link to the Jupyter Notebook of the working code above. https://github.com/IBepyProgrammer/File-theory/blob/main/CSV file theory.ipynb

2. Pickle files

Pickle files in Python refer to files that store serialized objects using the "pickle" module. Serialization converts a Python object into a byte stream, and deserialization is the reverse of the process which involves reconstructing the original object from a byte stream.

Pickle is a module in Python that provides a way to serialize and deserialize objects.

Here's a brief overview of how pickle files work:

  1. Serialization: Pickle serializes Python objects into a binary format, which can be stored in a file or transmitted over a network. The serialized data is a byte stream that represents the state of the original object.

  2. Deserialization: Pickle can also deserialize the byte stream back into a Python object, allowing you to reconstruct the original object with its state intact.

Here we can show how to use Pickle to write and read objects to and from a file:

  1. We first import the pickle module in Python.

import pickle

In this example, a dictionary (data) is serialized and written to a file named 'my_grocery.pickle'. Later, the file is read, and the data is deserialized back into a Python object (loaded_data), which is then printed as follows:


my_grocery = {"apples":5, "bananas":10, "cabbages":3, "mangoes":7, "tomatoes":10}

pickle.dump(my_grocery, open("grocery.plk", "wb"))

  1. We can then manage the data from the pickle file created named "grocery.plk". This is done by loading the data in the pickle file using the "load" function from the pickle module.

read_grocery = pickle.load(open("grocery.plk", "rb"))

print(read_grocery)

Keep in mind the following points about pickle files:

  • Pickle is specific to Python, and the serialized data is not human-readable or compatible with other programming languages.

  • Pickle should be used with caution when loading data from untrusted sources, as it can execute arbitrary code during deserialization, leading to security risks.

  • For more human-readable and cross-language compatible data interchange, consider using formats like JSON or XML.

In summary, pickle files are a convenient way to serialize and deserialize Python objects, providing a means of storing and retrieving complex data structures.

I have provided a link to the Jupyter Notebook of the working code above. https://github.com/IBepyProgrammer/File-theory/blob/main/Pickle file theory.ipynb

3. JSON files( JavaScript Object Notation)

JavaScript Object Notation or JSON is a lightweight data-interchange format that is easy for humans to read and write. This makes it easy for machines to parse and generate. In Python, the JSON module is used to work with JSON data.

In this example we will demonstrate how you can work with JSON files in Python:

  1. Encoding: Convert a Python object into a JSON-formatted string. We will first begin by importing the JSON module and creating a sample dictionary that will be converted into a JSON formatted string.

import json

school = {

    "school_of": "Science_Technology",

    "faculty": "Applied_science",

    "departments": {

        "physics": "Optics",

        "Chemistry": "industrial_chemistry"

    },

    "years":[

        "freshman",

        "sophomore",

        "Junior",

        "Senior"

    ],

    "numbers":[1,2,3,4],

    "id":[123,234,456,566]

}

json.dump(school, open("university.json", "w"))

  1. Decoding: Convert a JSON-formatted string into a Python object by loading the JSON file using the "load" function in the JSON module.

read_school = json.load(open("university.json", "r"))

print(read_school)

In the example above:

  • json.dump() is used to write the JSON string to a file.

  • json.load() is used to read a JSON string from a file and decode it into a Python object.

JSON is a widely used data interchange format and is not tied to any specific programming language. It's readable and easy to understand, making it a popular choice for data exchange between different systems and platforms. JSON is often used as an alternative to pickle when working with Python and external systems due to its cross-language compatibility and simplicity. I have provided a link to the Jupyter Notebook of the working code above. https://github.com/IBepyProgrammer/File-theory/blob/main/JSON file theory.ipynb

4. Managing PDF files using python

PDF stands for portable document format.

Any PDF has the file extension .pdf , i.e. the name of the document followed by its file extension.

To work with PDF files in Python we use the PyPDF2 library.

We will use this library to:

  • Extract document information

  • Split and Merge documents page by page

  • Crop pages

  • Merge multiple pages into a single page

  • Encrypt and decrypt PDF files

To install and begin utilizing this module, we can utilize pip .


pip install PyPDF2


import PyPDF2

We use the PyPDF2 module to open and read the PDF document.


my_pdf = open("sample.pdf", "rb")

read_document = PyPDF2.PDFFileReader(my_pdf)

We can print the PDF document and display the number of pages.


print(read_document.numPages)

We can extract text from specific pages and manipulate it using Python.


pages = read_document.getPage(0)

print(pages.extractText())

To close the document, you can use the method demonstrated in the cell below.


my_pdf.close()

Conclusion

In this short article, we learn how to work with CSV, JSON, Pickle, and PDF files in Python. In future articles, we will dive deeper into file theory and learn how to manipulate data from these files using different Python libraries such as Matplotlib, pandas, and Seaborn.

If you found this article helpful consider subscribing to Ibepyprogrammer and sharing.

Thank you.

Sign up for our newsletter

Don't miss anything. Get all the latest posts delivered to your inbox. No spam!