Podcast JSON Files

Summary

The files listed below were created via the podcast feeds listed in the Podcast Index database, which provides a SQLite3 DB file download on the Podcast Index homepage. The files were created from podcasts that were updated between July 7, 2021 to October 7, 2021. I created these files to see what it would take to build a podcast search engine utilizing the Typesense search engine software.

JSON File Downloads

The files are provided in bzip2 archives with approximately 200,000 files per archive.

File Structure

There are two JSON files per podcast. The files were generated using the Python feedparser library and maintain most of the structure feedparser provides "out of the box". There was minimal data manipulation done in order to make the data more usable for search. There is a lot of duplication of fields, mostly because of how messy the world of RSS can be and how feedparser parses some attributes. Ideally, these files would be leaner with unnecessary duplicate attributes removed for real world use.

Podcast file includes a JSON object with the information for the podcast included within the main RSS channel tags.
Episodes file includes JSON objects (in the JSON Lines format) for the podcast's episodes. 1 JSON object per episode on each line. The files are linked together via the feed_link attribute

Typesense Collections

In case it's helpful, below is a simple Python script that defines the Typesense collections I've been testing with. These collections will demand a lot of RAM if you import all the files above, as I'm indexing summary fields that can be relatively large. Some podcasts include full transcripts in their summary fields along with notes, links, etc. Note the variety of archive file sizes. Even though they all contain similar numbers of files, the file sizes can be drastically different due to the number of episodes per podcast and the amount of content provide podcasts have versus others.

typesense_exec.py