I started the project in the autumn of 2017 when I could not find a good source of podcasts .. or rather I could find some podcasts but could not search and sift …
So, I started to index podcast networks and podcast providers to create a searchable database of what was out there. First round I got some 30000 unique feeds indexed, stumbled over more sources and added them, 50000 and later 80000 feeds. The equipment I was running at is a few old desktop machines from the beginning of the 2000’s, at 1Mhz single core with 1Gb memory for scraping and a similar machine for database, I was able to index some 15000-25000 a day.
Now in August 2019, I’m at 560000 feeds and have switched up my equipment to slightly more modern setup, 3.2GHz with 4Gb for the scraper and 3.2GHz with 8Gb for database. I index about 80000 to 120000 feeds a day, so in just a few days the whole index is refershed. Update: As this blog post has been marinating for a while, a few months, the constant updates and code changes, optimization and tweaks, I’m currently at over 830000 active feeds and are able to index some 100000+ a day, system is still not fully developed as I want it but slowly getting there, distributed workload over all my servers and better indexing methods, I’m getting there, slowly.
The typical way of listening to podcast is to use the manufacturers provided podcast application on smart phone. There you will be presented a curated set of popular podcasts and are then suggested other podcasts arranged in categories if you explore a little. There are other means of consuming podcasts as well, separate apps on your smart phone or home/car stereo equipment provide access to their curated directories, also sectioned off into categories.
One common thing about these described types of consumption of podcasts is that they never or rarely expose the podcast source, that is hidden away from the user/listener, that in combination with the fact that if you would listen to some fringe, non-mainstream podcast, not part of their curated subset of all available podcasts, you cannot add it either. Walled gardens.
Concentration of power. Apple iTunes is a walled garden. Google Podcasts is a walled garden. Stitcher is a walled garden. Spotify is a walled garden. Anchor.fm is a walled garden. None of these are available to you unless you “buy-in”, get a subscription and/or install their apps or own a certain type of hardware to consume the provided content. Other sources of podcasts require you to register an account and login to be able to search or browse their directory/catalog, I consider these to be walled gardens as well as they won’t allow any access unless logged in, no drive-by subscription of podcast or podcasts.
If anyone of these would go away, either go tits-up, get bought or just disappear, you would have troubles finding the podcasts you have subscribed to as some podcast producers focus on one single outlet, a walled garden. Even though the actual feed is hosted somewhere publicly available, it is often just presented as “find us on iTunes” or other walled garden.
Sure, you can export your selection of podcast feeds, if you really want to, but again, it’s the ones you have already found from the subsets they present. The exports are either completely proprietary solutions or open OPML, where the OPML variant is the most portable option.
What I want to achieve with project “Podmix” is to open up the podcast sphere to everyone, with a focus on audio podcasts, no walls and no filters, you should be able to search and sift and no suggestions should pull you in any direction. If you are looking for “murders” and “mysteries”, you’d get search results only related to what you actually searched for. To open up and expose all possible available podcasts to everyone is a gain for everyone, producers and consumers.
Producers would need to get indexed once and as long as they post an episode at least once within every 12 months, they will be a part of the index.
Consumers can search, pick and collect channels/feeds as they please and when done, get an OPML-file to pull these feeds into any podcast catcher application capable of importing through OPML.
At the moment there is podcasts in 112 distinct languages across about 830000 indexed “active” podcasts, these include; Abkhazian, Afar, Afrikaans, Akan, Albanian, Amharic, Arabic, Armenian, Assamese, Bambara, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central Khmer, Chamorro, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Faroese, Finnish, French, Gaelic, Galician, Georgian, German, Greek, Guarani, Haitian, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Interlingue, Irish, Italian, Japanese, Javanese, Kannada, Kashmiri, Kinyarwanda, Korean, Kurdish, Lao, Latin, Latvian, Lithuanian, Luba-Katanga, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Northern Sami, Norwegian Bokmål, Norwegian Nynorsk, Norwegian, Oromo, Pashto, Persian, Polish, Portugese, Punjabi, Quechua, Romanian, Russian, Sanskrit, Serbian, Sinhala, Slovak, Slovenian, Somali, South Ndebele, Spanish, Swahili, Swati, Swedish, Tagalog, Tajik, Tamil, Telugu, Thai, Tibetan, Tigrinya, Tsonga, Tswana, Turkish, Ukrainian, Urdu, Uzbek, Venda, Vietnamese, Welsh, Wolof, Xhosa, Yiddish, Yoruba, Zulu. By “active” means “a podcast feed that has produced at least one episode within the last 12 months with a MIME-type that indicates it is audio”.
Podcasts has no borders
There is constant work with correct and fix the data indexed as either producers mis-categorize and/or do not set language properly. As well to find safe ways of fetching feeds, while HTTPS is available, many feeds are not declared or referenced to use HTTPS, work is done to test and validate use of HTTPS as far as possible. To keep your listening and subscriptions safe and hidden away from your telephony provider, ISP or Government.
There is some spin-off data available here: b19.se/data/
The difference between project “Podmix” and other podcast directories is that “Podmix” is focused exclusively on audio podcasts, which are fresh and have published within the last 12 months. Other directories has stale feeds that have been orphaned, vanished or just stopped, as in feed is not reachable any longer or last produced episodes in 2006.
There are of course a exceptions to the rule, podcast feeds that do not fit in the freshness but are instead audio books in audio podcast shape, there is feeds available from LoyalBooks and LibriVox, music from Magnatune.