Perun: Difference between revisions
(→/home/user/nasaimg.py: recursive download) |
(→/home/user/yt.py: add cards process) |
||
(One intermediate revision by the same user not shown) | |||
Line 21: | Line 21: | ||
For regularly downloading youtube videos to be displayed in [[Lada#Youtube]]. Because Perun container has network passwordless readwrite access to [[Lada#db_videos]], this script is entirely self-contained for adding or refreshing information on youtube videos, channels or playlists. Uses the python module <code>yt_dlp</code> and <code>ffmpeg</code> as main dependencies. The regularly scheduled task is as follows: | For regularly downloading youtube videos to be displayed in [[Lada#Youtube]]. Because Perun container has network passwordless readwrite access to [[Lada#db_videos]], this script is entirely self-contained for adding or refreshing information on youtube videos, channels or playlists. Uses the python module <code>yt_dlp</code> and <code>ffmpeg</code> as main dependencies. The regularly scheduled task is as follows: | ||
# Download newly uploaded videos to <code>/mnt/youtube/{channel_id}/{video_id}.mkv</code>. | # Download newly uploaded videos to <code>/mnt/youtube/{channel_id}/{video_id}.mkv</code>. On network errors, keep retrying | ||
# Move thumbnails to <code>/mnt/icache/youtube/{channel_id}/{video_id}.{img_fmt}</code>. | # Move thumbnails to <code>/mnt/icache/youtube/{channel_id}/{video_id}.{img_fmt}</code>. | ||
# Clean up temporary download files like <code>.part</code> and pre-transcoded files like <code>.webm</code>. Those were kept by <code>yt-dlp</code> to allow for resuming downloads instead of restarting them when the internet is very slow or unreliable. By this point, they will not be useful anymore. | # Clean up temporary download files like <code>.part</code> and pre-transcoded files like <code>.webm</code>. Those were kept by <code>yt-dlp</code> to allow for resuming downloads instead of restarting them when the internet is very slow or unreliable. By this point, they will not be useful anymore. | ||
Line 29: | Line 29: | ||
## Generate same-size and same-aspect-ratio (640x360p) thumbnails to <code>/mnt/icache/youtube/{channel_id}/{video_id}.360p.{img_fmt}</code>, with <code>ffmpeg</code>. | ## Generate same-size and same-aspect-ratio (640x360p) thumbnails to <code>/mnt/icache/youtube/{channel_id}/{video_id}.360p.{img_fmt}</code>, with <code>ffmpeg</code>. | ||
## For any special video files like <code>.1080p.mkv</code>, add them to their parent video as <code>altvideos</code>. | ## For any special video files like <code>.1080p.mkv</code>, add them to their parent video as <code>altvideos</code>. | ||
#Fetch sponsorblock segments of all recently added videos. | |||
#Download .html pages of videos which do not yet have them, and then parse that html to extract cards information and load that into <code>cards</code> and <code>cards_ranges</code> tables in db. | |||
# Check if 1080p versions are missing: download to <code>/mnt/youtube/{channel_id}/{video_id}.1080p.mkv</code> if the video necessitates a smaller 1080p version. See [[Lada#Youtube]] for these database specifics. | # Check if 1080p versions are missing: download to <code>/mnt/youtube/{channel_id}/{video_id}.1080p.mkv</code> if the video necessitates a smaller 1080p version. See [[Lada#Youtube]] for these database specifics. | ||
# Rescan database to add those <code>.1080p</code> videos. | # Rescan database to add those <code>.1080p</code> videos. | ||
Line 38: | Line 40: | ||
* downloading playlists of channels and adding those to the database. | * downloading playlists of channels and adding those to the database. | ||
== /home/user/ | == /home/user/podcasts.py == | ||
Multistep process to update | Multistep process to update podcast mirrors. Has database acces on [[Lada#db audio]]. Performs a fully-automated: | ||
* Fetch <code>url | * Fetch <code>url,</code> which it gets from <code>dbname=audio podcasts:url</code>. | ||
* XML parses the not-yet in database <code><item></code>s | * XML parses the not-yet in database <code><item></code>s | ||
* Imports those to database and <code>wget</code>-downloads to the correct location in <code>/mnt/audio/{ | * Imports those to database and <code>wget</code>-downloads to the correct location in <code>/mnt/audio/{podcasts:isradio?"radio":"podcasts"}/{podcasts:directory}/{filename}</code>. | ||
* Deduce <code>youtube_id</code>s of newly downloaded episodes, iff <code>podcasts:yt_channel IN NOT NULL</code>. | |||
* For any recent <code>youtube_id</code>s, re-fetch the sponsorblock segments and load them to the sponsorblock database | |||
* Check publication statuses of episodes and if they have not yet been published but have already been downloaded for some threshold (48h), publish them. This is to allow for sponsorblock segments to be submitted in that interval | |||
This also requires the given podcast's data fromat quirks to be specified/hardcoded in <code>/home/user/audio_podcasts_parse.py</code> for how which XML keys should be translated or converted into data for the database. Also checks database contstraints like | This also requires the given podcast's data fromat quirks to be specified/hardcoded in <code>/home/user/audio_podcasts_parse.py</code> for how which XML keys should be translated or converted into data for the database: <code><summary></code> and <code><description></code> items are sometimes used differently by various podcasts. Also checks database contstraints like string lengths or filesize in advance. | ||
These quirks are not all trivial, for example: | These quirks are not all trivial, for example: | ||
* <code>/home/user/sanitize_eschtml.py</code> | * <code>/home/user/sanitize_eschtml.py</code> is used to parse HTML and remove attributes of some tags and remove some tags (while preserving children of those tags). It can be used to compress the <code>description</code>s as it removes cluttering attributes and <code><nowiki><span></nowiki></code> tags that have no attributes. This script takes <u>escaped</u> ( <code>&</code> ⇔ <code>&amp;</code>, <code><</code> ⇔ <code>&lt;</code>, <code>></code> ⇔ <code>&gt;</code>) HTML as both input and output (that's how it's stored in database). | ||
* Filenames are not always unique if you take <code>basename(url)</code>. The workaround is to detect a hash-like parent diectory in the url, hoping that is unique (it has always been), and then adding that to the filename: e.g. <code>{hash}_{default_128.mp3}</code>. | * Filenames are not always unique if you just take <code>basename(url)</code>. The workaround is to detect a hash-like parent diectory in the url, hoping that is unique (it has always been), and then adding that to the filename: e.g. <code>{hash}_{basename(url)=default_128.mp3}</code>. | ||
'''Maybeplanned''' is to move the parser's per-podcast storage to database to make it cleaner (a <code>podcasts_parse child_of(podcasts)</code> table), but it is only for ~5 podcasts anyway, and source code python storage as strings would be clumsier than it is now; and a custom format for key->function mappings is also a no-go. | |||
== /home/user/nasaimg.py == | == /home/user/nasaimg.py == |
Latest revision as of 07:00, 6 December 2023
LXC container, connected to the internet and used for downloading everything. It has readwrite access to relevant mountpoints and has wget
, youtube-dlp
, ffmpeg
, rsync
and lftp
as downloader/converter software installed.
/home/user/rsseth.py
Completely automatic ETHZ-video lecture recording downloader. It reads dir;url
lines from /mnt/video/lectures/urls.csv
where dir
is a directory under /mnt/video/lectures
and link
is a RSS feed of an ETHZ video lecture series. It parses the xml and extracts the enclosure:links
to the .mp4 files and their pubDate
. Then the video_url is downloaded to /mnt/video/lectures/{dir}/%Y-%m-%d.mp4
. It prints a last line reporting the work done: amount of feeds fetched, files detected (which are mentioned in the feeds but also already downloaded) and new files downloaded.
Some lectures require user authentication. For that, a my_cookie variable holds the login. When requesting the json urls, it adds the Cookie:
and User-Agent: <some browser>
headers. Those lectures are somethimes pathological and will not work on rss. The fallback is to fetch a json endpoint that seems to give similar information, though on a per-episode basis. Those are then cached in /mnt/video/lectures/{dir}/json/{episode_id}.json
, and they contain the pubDate
and url
to the actual mp4 file.
/home/user/git.sh
Fetches archived git repositories, runs automatically. Contains urls for which repos to fetch inline in a heredoc
. If a repo is new, git clone
it. Else git pull
it. To then display those in Lada#Gitea, see there for the slightly clumsy import process.
/home/user/repos.sh
Fetches software repositories to /mnt/software/{repo_class}
for offline (or pre-fetched) availability on higher speed local connection:
- Through
rsync
: fetchesarchlinux
andartixlinux
entire package repositories. Also fetches database files allowing completely functional package mirror operations, see Lada#Arch and artix package mirror. - Through
wget
: fetchesarchzfs
repo.
/home/user/osmdl.py
Fetches the newest .osc.gz
daily changefiles from openstreetmap for osm container into /mnt/maps/tmp/
, to be processed by Osm#Data_updates. The latest already imported changefile is identified by its sequence number, and that written to /mnt/maps/state.txt
. The state number therefore refers to which changefiles have already been downloaded. Whether they are in the database is indicated if they are removed from /mnt/maps/tmp
or not. Because this file presence/absence shows whether a changefile was successfully or not imported to database, the maps
dataset is a subdataset of dbp
which hosts the entire database: a recursive snapshot always contains a self-consistent state of the data. See Osm#Filesystem for details.
/home/user/yt.py
For regularly downloading youtube videos to be displayed in Lada#Youtube. Because Perun container has network passwordless readwrite access to Lada#db_videos, this script is entirely self-contained for adding or refreshing information on youtube videos, channels or playlists. Uses the python module yt_dlp
and ffmpeg
as main dependencies. The regularly scheduled task is as follows:
- Download newly uploaded videos to
/mnt/youtube/{channel_id}/{video_id}.mkv
. On network errors, keep retrying - Move thumbnails to
/mnt/icache/youtube/{channel_id}/{video_id}.{img_fmt}
. - Clean up temporary download files like
.part
and pre-transcoded files like.webm
. Those were kept byyt-dlp
to allow for resuming downloads instead of restarting them when the internet is very slow or unreliable. By this point, they will not be useful anymore. - Trigger a database rescan :
- Look at all video files in
/mnt/youtube
that are not in the database yet. - Add them by reading from the deduced
.info.json
, and deduce thumbnail by corresponding.{img_fmt}
. - Generate same-size and same-aspect-ratio (640x360p) thumbnails to
/mnt/icache/youtube/{channel_id}/{video_id}.360p.{img_fmt}
, withffmpeg
. - For any special video files like
.1080p.mkv
, add them to their parent video asaltvideos
.
- Look at all video files in
- Fetch sponsorblock segments of all recently added videos.
- Download .html pages of videos which do not yet have them, and then parse that html to extract cards information and load that into
cards
andcards_ranges
tables in db. - Check if 1080p versions are missing: download to
/mnt/youtube/{channel_id}/{video_id}.1080p.mkv
if the video necessitates a smaller 1080p version. See Lada#Youtube for these database specifics. - Rescan database to add those
.1080p
videos. - Again like step 3 remove .part and similar temporary files.
This main script also allows for maintenance tasks to be run, like:
- re-downloading all
.info.json
files of any video present in the database and updating that info. - downloading playlists of channels and adding those to the database.
/home/user/podcasts.py
Multistep process to update podcast mirrors. Has database acces on Lada#db audio. Performs a fully-automated:
- Fetch
url,
which it gets fromdbname=audio podcasts:url
. - XML parses the not-yet in database
<item>
s - Imports those to database and
wget
-downloads to the correct location in/mnt/audio/{podcasts:isradio?"radio":"podcasts"}/{podcasts:directory}/{filename}
. - Deduce
youtube_id
s of newly downloaded episodes, iffpodcasts:yt_channel IN NOT NULL
. - For any recent
youtube_id
s, re-fetch the sponsorblock segments and load them to the sponsorblock database - Check publication statuses of episodes and if they have not yet been published but have already been downloaded for some threshold (48h), publish them. This is to allow for sponsorblock segments to be submitted in that interval
This also requires the given podcast's data fromat quirks to be specified/hardcoded in /home/user/audio_podcasts_parse.py
for how which XML keys should be translated or converted into data for the database: <summary>
and <description>
items are sometimes used differently by various podcasts. Also checks database contstraints like string lengths or filesize in advance.
These quirks are not all trivial, for example:
/home/user/sanitize_eschtml.py
is used to parse HTML and remove attributes of some tags and remove some tags (while preserving children of those tags). It can be used to compress thedescription
s as it removes cluttering attributes and<span>
tags that have no attributes. This script takes escaped (&
⇔&
,<
⇔<
,>
⇔>
) HTML as both input and output (that's how it's stored in database).- Filenames are not always unique if you just take
basename(url)
. The workaround is to detect a hash-like parent diectory in the url, hoping that is unique (it has always been), and then adding that to the filename: e.g.{hash}_{basename(url)=default_128.mp3}
.
Maybeplanned is to move the parser's per-podcast storage to database to make it cleaner (a podcasts_parse child_of(podcasts)
table), but it is only for ~5 podcasts anyway, and source code python storage as strings would be clumsier than it is now; and a custom format for key->function mappings is also a no-go.
/home/user/nasaimg.py
Python script to download NASA's astronomy picture of the day (apod). The production process at NASA takes full advantage of the html format with the following:
- description text with hyperlinks, to the rest of the internet and to past apods as well
- hyperlink to a high resolution version, on the image
- sometimes an <iframe> video instead of an image
- javascript that, on mouse hover over the image, changes the
src=...
to another copy of the image, with annotations - javascript tracking link
The initial download of the html is done with wget, with settings to download dependencies as well. But this is not enough and a python-based html parser does the rest, checking href=...
links, and iframes. It then downloads those locally and rewrites the link in the .html
, and also removes the javascript tracking.
A regex match also tries to find any of the onhover javascript .src=...
changes and downloads those locally if they are images.
Because the apods link to eachother, a recursive download is also implemented. By default, a recursion depth of 4 starting from all apods after 2023-01-01 is run regularly. This means any apod link is locally available, up to a depth of 4.