Performing Exploratory Data Analysis on the OpenADMET ExpansionRx Blind Challenge Dataset

3 minute read

Published:


When working with a new dataset, many people quickly jump into building a machine learning model. I prefer to start with exploratory data analysis (EDA) to gain a deeper understanding of the data. To address this need, I created a notebook that performs initial EDA on the OpenADMET ExpansionRx Blind Challenge Dataset. Instead of using Jupyter for this analysis, I’m using marimo, a new open source data science notebook environment that enables the creation of interactive data apps with minimal code. I think of marimo as a “better Jupyter” because it offers several features that simplify building interactive data apps, including built-in support for Altair charts, an enhanced table view, and interactive widgets. I’m working on a repository titled “Practical Cheminformatics with marimo,” which demonstrates some ways to use marimo for cheminformatics tasks. This code should be ready in a couple of weeks. Please consider this notebook a preview of what’s to come. For those interested in learning more about marimo, I recommend starting with the following resources.

There are a few aspects of marimo that might confuse some long-time Jupyter users (including me):

  1. The output of a code cell in marimo appears above the cell in the notebook instead of below it, as in Jupyter.
  2. A marimo notebook is reactive, meaning that when you change the value of a variable, any code cells that depend on that variable will automatically update. This differs from Jupyter, where you need to manually re-run code cells to see the updated output. It also means that a variable can only be defined once in a marimo notebook.
  3. When you run a marimo notebook, it first checks if you have the necessary libraries installed. If not, marimo will ask if you’d like to install them and will install them for you. This makes it easy to share marimo notebooks with others without worrying about dependencies. This notebook has the dependencies inlined; if you run it using the --sandbox flag, marimo will create a sandboxed environment and automatically install the dependencies.
  4. When you open an existing marimo notebook, it runs the code in all the cells, which can take some time to start. Wait for the spinning hourglass in the upper left corner to disappear.

Using the marimo notebook is easy, just follow these simple steps.

1. Download the notebook from GitHub. Note that marimo notebooks are simply Python files with a .py extension. You can download the file from the command line with this command.


wget https://raw.githubusercontent.com/PatWalters/practical_cheminformatics_posts/refs/heads/main/expansion_data_exploration/openadmet_expansion_exploration.py

2. Install marimo and uv using the following command:


pip install uv marimo

3. Use the marimo command to run the notebook. This command installs all the dependencies and launches the marimo notebook in a sandboxed environment.


marimo edit openadmet_expansion_exploration.py --sandbox

4. Enjoy!

If you’re not feeling very adventurous, you can view an HTML version of the notebook here.

Where’s the Code?

The code and notebook for this post can be found in this GitHub repo https://github.com/PatWalters/practical_cheminformatics_posts/tree/main/expansion_data_exploration

Acknowledgements

Thanks to Hugo MacDermott-Opeskin for testing the notebook. I wouldn’t have tried marimo if it weren’t for blogs by Eric Ma and Srijit Seal. Those guys are a constant source of inspiration. Thanks to Ramon Miranda-Quintana, Ignacio Pickering, and Kenneth Lopez Perez for their help with the BitBIRCH and BBLean clustering methods I used in the notebook. Special thanks to the marimo team. They’ve created an amazing tool, their support is fantastic, and Vincent’s videos are the best!