Perun: Difference between revisions

From Personal wiki
(→‎/home/user/audio_podcasts_fetch2db.py: add new changes and remove deprecated features)
(→‎/software/git_clone.sh: rename and explain a bit more)
Line 4: Line 4:
Completely automatic ETHZ-video lecture recording downloader. It reads <code>dir;url</code> lines from <code>urls.csv</code> where <code>dir</code> is a directory under <code>/video/lectures</code> and <code>link</code> is a RSS feed of an ETHZ video lecture series. It fetches the RSS feed through the <code>feed2exec</code> python module, that converts every item to a csv line. Those are internally interpreted to extract the published date in <code>%Y-%m-%d</code> format and the video url is then downloaded to <code>{dir}/%Y-%m-%d.mp4</code>. It prints a last line reporting the work done: amount of feeds fetched, files detected (which are mentioned in the feeds but also already downloaded) and new files downloaded.
Completely automatic ETHZ-video lecture recording downloader. It reads <code>dir;url</code> lines from <code>urls.csv</code> where <code>dir</code> is a directory under <code>/video/lectures</code> and <code>link</code> is a RSS feed of an ETHZ video lecture series. It fetches the RSS feed through the <code>feed2exec</code> python module, that converts every item to a csv line. Those are internally interpreted to extract the published date in <code>%Y-%m-%d</code> format and the video url is then downloaded to <code>{dir}/%Y-%m-%d.mp4</code>. It prints a last line reporting the work done: amount of feeds fetched, files detected (which are mentioned in the feeds but also already downloaded) and new files downloaded.


== /software/git_clone.sh ==
== /home/user/git.sh ==
Fetches archived git repositories, ran automatically.
Fetches archived git repositories, runs automatically. Contains urls for which repos to fetch inline in a <code>heredoc</code>. If a repo is new, <code>git clone</code> it. Else <code>git pull</code> it. To then display those in [[Lada#Gitea]], see there for the slightly clumsy import process.


== /software/syncrepo-template.sh ==
== /software/syncrepo-template.sh ==

Revision as of 13:57, 7 January 2023

LXC container, connected to the internet and used for downloading everything. It has readwrite access to relevant mountpoints and has wget, youtube-dlp, rsync and lftp as download software installed.

/video/lectures/updatefeeds.sh

Completely automatic ETHZ-video lecture recording downloader. It reads dir;url lines from urls.csv where dir is a directory under /video/lectures and link is a RSS feed of an ETHZ video lecture series. It fetches the RSS feed through the feed2exec python module, that converts every item to a csv line. Those are internally interpreted to extract the published date in %Y-%m-%d format and the video url is then downloaded to {dir}/%Y-%m-%d.mp4. It prints a last line reporting the work done: amount of feeds fetched, files detected (which are mentioned in the feeds but also already downloaded) and new files downloaded.

/home/user/git.sh

Fetches archived git repositories, runs automatically. Contains urls for which repos to fetch inline in a heredoc. If a repo is new, git clone it. Else git pull it. To then display those in Lada#Gitea, see there for the slightly clumsy import process.

/software/syncrepo-template.sh

Fetches, through rsync, archlinux and artixlinux (depending on $1) entire package repositories. Also fetches database files allowing completely functional package mirror operations, see Lada#Arch and artix package mirror.

/home/user/osmdl.py

Fetches the newest .osc.gz daily changefiles from openstreetmap for osm container into /mnt/maps/tmp/, to be processed by Osm#Data_updates. The latest already imported changefile is identified by its sequence number, and that written to /mnt/maps/state.txt. The state number therefore refers to which changefiles have already been downloaded. Whether they are in the database is indicated if they are removed from /mnt/maps/tmp or not. Because this file presence/absence shows whether a changefile was successfully or not imported to database, the maps dataset is a subdataset of dbp which hosts the entire database: a recursive snapshot always contains a self-consistent state of the data. See Osm#Filesystem for details.

/home/user/yt.py

For regularly downloading youtube videos to be displayed in Lada#Youtube. Because Perun container has network passwordless readwrite access to Lada#db_videos, this script is entirely self-contained for adding or refreshing information on youtube videos, channels or playlists. Uses the python module yt_dlp and ffmpeg as main dependencies. The regularly scheduled task is as follows:

  1. Download newly uploaded videos to /mnt/youtube/{channel_id}/{video_id}.mkv.
  2. Move thumbnails to /mnt/icache/youtube/{channel_id}/{video_id}.{img_fmt}.
  3. Clean up temporary download files like .part and pre-transcoded files like .webm. Those were kept by yt-dlp to allow for resuming downloads instead of restarting them when the internet is very slow or unreliable. By this point, they will not be useful anymore.
  4. Trigger a database rescan :
    1. Look at all video files in /mnt/youtube that are not in the database yet.
    2. Add them by reading from the deduced .info.json, and deduce thumbnail by corresponding .{img_fmt}.
    3. Generate same-size and same-aspect-ratio (640x360p) thumbnails to /mnt/icache/youtube/{channel_id}/{video_id}.360p.{img_fmt}, with ffmpeg.
    4. For any special video files like .1080p.mkv, add them to their parent video as altvideos.
  5. Check if 1080p versions are missing: download to /mnt/youtube/{channel_id}/{video_id}.1080p.mkv if the video necessitates a smaller 1080p version (if it is natively >1080p).
  6. Rescan database to add those .1080p videos.
  7. Again like step 3 remove .part and similar temporary files.

This main script also allows for maintenance tasks to be run, like:

  • re-downloading all .info.json files of any video present in the database and updating that info.
  • downloading playlists of channels and adding those to the database.

/home/user/audio_podcasts_fetch2db.py

Multistep process to update a podcast. Has database acces on Lada#db audio. Performs a fully-automated:

  • Fetch url={dbname=audio:podcasts:url}.
  • XML parses the not-yet in database <item>s
  • Imports those to database and wget-downloads to the correct location in /mnt/audio/{dbname=audio:podcasts:isradio ? radio:podcasts}/{dbname=audio:podcasts:directory}/{filename}.

This also requires the given podcast's data fromat quirks to be specified/hardcoded in /home/user/audio_podcasts_parse.py for how which XML keys should be translated or converted into data for the database. Also checks database contstraints like length in advance.

These quirks are not all trivial, for example:

  • /home/user/sanitize_eschtml.py can parse HTML and remove attributes of some tags and remove some tags (while preserving children of those tags). It can be used to compress the descriptions as it removes cluttering attributes and <span> tags that have no attributes. This script takes escaped ( &&amp;, <&lt;, >&gt;) HTML as both input and output (that's how it's stored in database).
  • Filenames are not always unique if you take basename(url). The workaround is to detect a hash-like parent diectory in the url, hoping that is unique (it has always been), and then adding that to the filename: e.g. {hash}_{default_128.mp3}.

This script is called in podcasts.sh to automatically fetch known (specified parseable) podcasts regularly. Maybeplanned is to move the parser's per-podcast storage to database to make it cleaner (a podcasts_parse child_of(podcasts) table), but it is only for ~5 podcasts anyway, and source code python storage as strings would be clumsier than it is now; and a custom format for key->function mappings is also a no-go.