No description

Find a file

Bill West 128b3cf3df Update README.md		2025-03-23 17:06:54 -06:00
common	Release links for remaining desired courses	2025-03-23 16:34:10 -06:00
infra	Sort infra stuff	2025-03-22 18:14:26 -06:00
usip-conflict-sensitivity-peacebuilding	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-consolidation-paix	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-cr-i	Update scrapes based on new sitemap	2025-03-22 18:14:47 -06:00
usip-cr-ii	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-cr-iii	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-cultural-synergy	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-design-monitor-eval	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-designing-community-dialogue	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-dialogo-basado-comunidad	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-diseno-implementacion-estrategica	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-engagement-religieux-consolida	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-gov-2	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-governance-2	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-governance-guiding-principles	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-intro-peacebuilding	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-intro-peacebuilding-armenian	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-intro-peacebuilding-turkish	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-intro-reconciliation	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-media-arts-peace	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-mediacion-para-paz	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-negotiation-conflict-landscape	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-nonviolent-action	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-prev-election-violence	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-psychosocial-responsive-devel	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-religion-peacebuilding-intro	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-religions-beliefs-human-rights	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-religious-engagement-peacebuilding	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-rule-of-law	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-security-sector-governance-reform	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-snap	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-snap-espanol	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-ssgr-assessment-design-eval	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-systems-thinking	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-un-civil-mil-coordination	Update scrape schema	2025-03-23 16:05:13 -06:00
usip-youth-led-peacebuilding	Update scrape schema	2025-03-23 16:05:13 -06:00
.gitignore	Update README and deps	2025-03-23 16:33:37 -06:00
assemble.py	Update assembly indexing/link construction	2025-03-23 16:05:49 -06:00
media-scrape.py	Make minor tweaks to scrape/upload	2025-03-23 16:33:48 -06:00
README.md	Update README.md	2025-03-23 17:06:54 -06:00
requirements.txt	Update README and deps	2025-03-23 16:33:37 -06:00
sitemap.json	Update scrape schema	2025-03-23 16:05:13 -06:00
styles.css	Update assembly indexing/link construction	2025-03-23 16:05:49 -06:00
upload.py	Make minor tweaks to scrape/upload	2025-03-23 16:33:48 -06:00

README.md

usip-scrape

Scrapes courses from the USIP Gandhi-King Global Academy. To be used in concert with Open Web Scraper (a Chrome plugin).

Setup

Install Chrome or Chromium Browser

The required browser extension only works with Chrome

Install the Web Scraper - Free Extension

Get from the Chrome Web Store. Optionally: watch the intro video to get the gist of the workflow

Install python3 and venv/dependencies

Make sure python3 is installed on your system
Make sure python3-pip is installed python3 -m ensurepip --upgrade
Ensure python3-venv in installed python -m pip install --user virtualenv
Create a virtualenv python -m venv ./venv
Activate the virtualenv source venv/bin/activate
Install the dependencies pip install -r requirements.txt

Install ffmpeg for Video Conversion

There is a command-line tool called ffmpeg that is sort of a swiss-army knife for media conversion. It is used as a post-processor after media scraping, so be sure it's installed.

Course Scrape Procedure

Navigate to your USIP GK Academy Course

Make a note of the root URL of the course content, example: https://usip-global-campus.mn.co/spaces/3223718/content; you will need it later

Open the Chrome Developer Console

Right-click / Inspect

Open the Web Scraper Tab

In the developer console, if the plugin is installed there should be a tab/view for "Web Scraper". Click on it.

Open the submenu "Create new sitemap / Import Sitemap"

In the menu, be sure to select "Import Sitemap"

Paste the contents of sitemap.json from this repo

Past the sitemap.json contents into "Sitemap JSON"; ensure the "Sitemap name" is usip-sections and click the "Import Sitemap" button

Edit the sitemap metatada

Select the imported sitemap, then click on the menu "Sitemap usip-sections / Edit metadata". Paste in the course URL you saver earlier into "Start URL 1" and then click "Save Sitemap"

Run the scrape

Go to the submenu "Sitemap usip-sections / Scrape". Select longer intervals (e.g. 3000, 5000) to limit the impact on the site during scraping. This will make the overall scrape last longer. When ready, click the "Start Scraping" button. Another window will open up. Let it run without disturbing it. Chrome will send a notification when the scrape successfully finishes.

Export the scrape data

The post-processing scripts will expect a CSV copy of the scrape data. Once the scrape has finished (not before) go to the "Sitemap usip-sections / Export data" submenu. Select "CSV" to download the file. Keep track of the resulting download; it will be needed for post-processing (pulling down copies of embedded media and reassembling the page as static content).

Media Download Procedure

The first post-processing script media-scrape.py tries to pull copies of embedded/linked media from YouTube, Soundcloud, and Google Drive.

Go to course directory (e.g. ./usip-cultural-synergy) and run the script from there:

cd ./usip-cultural-synergy
../media-scrape.py usip-sections.csv

This will take a while since it's done slowly and sequentially to prevent getting rate-limited by the content platforms. Files will end up in static/posts/downloads.

Video Conversion

Done after a media-scraper run to convert mkv to mp4 and clean up mkv files:

( cd static/posts/downloads; for f in *.mkv; do  ffmpeg -n -i "./$f" -c:v libx264 -crf 23 -preset fast -c:a aac -b:a 128k "./${f%.mkv}.mp4"; done )
find static/posts/downloads -name "*.mkv" -exec {} rm \;

Static Page Assembly

This step:

creates complete static html files for each course section
replaces embedded media references with scraped media references (e.g. use mp4 file instead of youtube, pdf instead of google drive)
replaces linked media references with scraped media references
replaces usip links with mirror links
generates a course index that links to each section

../assemble.py usip-sections.csv

...there should be minimal to no errors about missing media. If there are some, it could be that media failed to scrape, is of a currently unsupported type, was missing from the original page, etc. It's not necessarily a show-stopper but ideally there are few or no errors at this stage, so each one should be investigated. Most fixes should be done either manually or by modifying media-scrape.py or assemble.py to gracefully handle the outlier.

Static Page Upload

Assembled pages can be uploaded to the storage bucket; this is essentially a content deployment. This will upload everything in static/posts/ (including downloads) to the storage bucket.

../upload.py

Index Modification

The root common/index.html is manually updated, but new course links should be added there when they are ready. The TTL on the index file is 5-60 minutes, so if you update it and want to see the results immediately, use an Incognito window or hard refresh (usually Ctrl-F5) in your browser.

TODO

Most of the important stuff (video, audio, transcripts, course structure) is covered already but there are some improvements that could be made:

document infra (cloudflare CDN, DNS, R2, Worker, secrets and manual bucket upload/delete maintenance)
capture imgix content (USIP's chosen CDN) and put in the bucket, re-link
deal with google book links
replicate surveys
replicate exams