No description
Find a file
2025-03-23 17:06:54 -06:00
common Release links for remaining desired courses 2025-03-23 16:34:10 -06:00
infra Sort infra stuff 2025-03-22 18:14:26 -06:00
usip-conflict-sensitivity-peacebuilding Update scrape schema 2025-03-23 16:05:13 -06:00
usip-consolidation-paix Update scrape schema 2025-03-23 16:05:13 -06:00
usip-cr-i Update scrapes based on new sitemap 2025-03-22 18:14:47 -06:00
usip-cr-ii Update scrape schema 2025-03-23 16:05:13 -06:00
usip-cr-iii Update scrape schema 2025-03-23 16:05:13 -06:00
usip-cultural-synergy Update scrape schema 2025-03-23 16:05:13 -06:00
usip-design-monitor-eval Update scrape schema 2025-03-23 16:05:13 -06:00
usip-designing-community-dialogue Update scrape schema 2025-03-23 16:05:13 -06:00
usip-dialogo-basado-comunidad Update scrape schema 2025-03-23 16:05:13 -06:00
usip-diseno-implementacion-estrategica Update scrape schema 2025-03-23 16:05:13 -06:00
usip-engagement-religieux-consolida Update scrape schema 2025-03-23 16:05:13 -06:00
usip-gov-2 Update scrape schema 2025-03-23 16:05:13 -06:00
usip-governance-2 Update scrape schema 2025-03-23 16:05:13 -06:00
usip-governance-guiding-principles Update scrape schema 2025-03-23 16:05:13 -06:00
usip-intro-peacebuilding Update scrape schema 2025-03-23 16:05:13 -06:00
usip-intro-peacebuilding-armenian Update scrape schema 2025-03-23 16:05:13 -06:00
usip-intro-peacebuilding-turkish Update scrape schema 2025-03-23 16:05:13 -06:00
usip-intro-reconciliation Update scrape schema 2025-03-23 16:05:13 -06:00
usip-media-arts-peace Update scrape schema 2025-03-23 16:05:13 -06:00
usip-mediacion-para-paz Update scrape schema 2025-03-23 16:05:13 -06:00
usip-negotiation-conflict-landscape Update scrape schema 2025-03-23 16:05:13 -06:00
usip-nonviolent-action Update scrape schema 2025-03-23 16:05:13 -06:00
usip-prev-election-violence Update scrape schema 2025-03-23 16:05:13 -06:00
usip-psychosocial-responsive-devel Update scrape schema 2025-03-23 16:05:13 -06:00
usip-religion-peacebuilding-intro Update scrape schema 2025-03-23 16:05:13 -06:00
usip-religions-beliefs-human-rights Update scrape schema 2025-03-23 16:05:13 -06:00
usip-religious-engagement-peacebuilding Update scrape schema 2025-03-23 16:05:13 -06:00
usip-rule-of-law Update scrape schema 2025-03-23 16:05:13 -06:00
usip-security-sector-governance-reform Update scrape schema 2025-03-23 16:05:13 -06:00
usip-snap Update scrape schema 2025-03-23 16:05:13 -06:00
usip-snap-espanol Update scrape schema 2025-03-23 16:05:13 -06:00
usip-ssgr-assessment-design-eval Update scrape schema 2025-03-23 16:05:13 -06:00
usip-systems-thinking Update scrape schema 2025-03-23 16:05:13 -06:00
usip-un-civil-mil-coordination Update scrape schema 2025-03-23 16:05:13 -06:00
usip-youth-led-peacebuilding Update scrape schema 2025-03-23 16:05:13 -06:00
.gitignore Update README and deps 2025-03-23 16:33:37 -06:00
assemble.py Update assembly indexing/link construction 2025-03-23 16:05:49 -06:00
media-scrape.py Make minor tweaks to scrape/upload 2025-03-23 16:33:48 -06:00
README.md Update README.md 2025-03-23 17:06:54 -06:00
requirements.txt Update README and deps 2025-03-23 16:33:37 -06:00
sitemap.json Update scrape schema 2025-03-23 16:05:13 -06:00
styles.css Update assembly indexing/link construction 2025-03-23 16:05:49 -06:00
upload.py Make minor tweaks to scrape/upload 2025-03-23 16:33:48 -06:00

usip-scrape

Scrapes courses from the USIP Gandhi-King Global Academy. To be used in concert with Open Web Scraper (a Chrome plugin).

Setup

Install Chrome or Chromium Browser

The required browser extension only works with Chrome

Install the Web Scraper - Free Extension

Get from the Chrome Web Store. Optionally: watch the intro video to get the gist of the workflow

Install python3 and venv/dependencies

  • Make sure python3 is installed on your system
  • Make sure python3-pip is installed python3 -m ensurepip --upgrade
  • Ensure python3-venv in installed python -m pip install --user virtualenv
  • Create a virtualenv python -m venv ./venv
  • Activate the virtualenv source venv/bin/activate
  • Install the dependencies pip install -r requirements.txt

Install ffmpeg for Video Conversion

There is a command-line tool called ffmpeg that is sort of a swiss-army knife for media conversion. It is used as a post-processor after media scraping, so be sure it's installed.

Course Scrape Procedure

Navigate to your USIP GK Academy Course

Make a note of the root URL of the course content, example: https://usip-global-campus.mn.co/spaces/3223718/content; you will need it later

Open the Chrome Developer Console

Right-click / Inspect

Open the Web Scraper Tab

In the developer console, if the plugin is installed there should be a tab/view for "Web Scraper". Click on it.

Open the submenu "Create new sitemap / Import Sitemap"

In the menu, be sure to select "Import Sitemap"

Paste the contents of sitemap.json from this repo

Past the sitemap.json contents into "Sitemap JSON"; ensure the "Sitemap name" is usip-sections and click the "Import Sitemap" button

Edit the sitemap metatada

Select the imported sitemap, then click on the menu "Sitemap usip-sections / Edit metadata". Paste in the course URL you saver earlier into "Start URL 1" and then click "Save Sitemap"

Run the scrape

Go to the submenu "Sitemap usip-sections / Scrape". Select longer intervals (e.g. 3000, 5000) to limit the impact on the site during scraping. This will make the overall scrape last longer. When ready, click the "Start Scraping" button. Another window will open up. Let it run without disturbing it. Chrome will send a notification when the scrape successfully finishes.

Export the scrape data

The post-processing scripts will expect a CSV copy of the scrape data. Once the scrape has finished (not before) go to the "Sitemap usip-sections / Export data" submenu. Select "CSV" to download the file. Keep track of the resulting download; it will be needed for post-processing (pulling down copies of embedded media and reassembling the page as static content).

Media Download Procedure

The first post-processing script media-scrape.py tries to pull copies of embedded/linked media from YouTube, Soundcloud, and Google Drive.

Go to course directory (e.g. ./usip-cultural-synergy) and run the script from there:

cd ./usip-cultural-synergy
../media-scrape.py usip-sections.csv

This will take a while since it's done slowly and sequentially to prevent getting rate-limited by the content platforms. Files will end up in static/posts/downloads.

Video Conversion

Done after a media-scraper run to convert mkv to mp4 and clean up mkv files:

( cd static/posts/downloads; for f in *.mkv; do  ffmpeg -n -i "./$f" -c:v libx264 -crf 23 -preset fast -c:a aac -b:a 128k "./${f%.mkv}.mp4"; done )
find static/posts/downloads -name "*.mkv" -exec {} rm \;

Static Page Assembly

This step:

  • creates complete static html files for each course section
  • replaces embedded media references with scraped media references (e.g. use mp4 file instead of youtube, pdf instead of google drive)
  • replaces linked media references with scraped media references
  • replaces usip links with mirror links
  • generates a course index that links to each section
../assemble.py usip-sections.csv

...there should be minimal to no errors about missing media. If there are some, it could be that media failed to scrape, is of a currently unsupported type, was missing from the original page, etc. It's not necessarily a show-stopper but ideally there are few or no errors at this stage, so each one should be investigated. Most fixes should be done either manually or by modifying media-scrape.py or assemble.py to gracefully handle the outlier.

Static Page Upload

Assembled pages can be uploaded to the storage bucket; this is essentially a content deployment. This will upload everything in static/posts/ (including downloads) to the storage bucket.

../upload.py

Index Modification

The root common/index.html is manually updated, but new course links should be added there when they are ready. The TTL on the index file is 5-60 minutes, so if you update it and want to see the results immediately, use an Incognito window or hard refresh (usually Ctrl-F5) in your browser.

TODO

Most of the important stuff (video, audio, transcripts, course structure) is covered already but there are some improvements that could be made:

  • document infra (cloudflare CDN, DNS, R2, Worker, secrets and manual bucket upload/delete maintenance)
  • capture imgix content (USIP's chosen CDN) and put in the bucket, re-link
  • deal with google book links
  • replicate surveys
  • replicate exams