| common | ||
| infra | ||
| usip-conflict-sensitivity-peacebuilding | ||
| usip-consolidation-paix | ||
| usip-cr-i | ||
| usip-cr-ii | ||
| usip-cr-iii | ||
| usip-cultural-synergy | ||
| usip-design-monitor-eval | ||
| usip-designing-community-dialogue | ||
| usip-dialogo-basado-comunidad | ||
| usip-diseno-implementacion-estrategica | ||
| usip-engagement-religieux-consolida | ||
| usip-gov-2 | ||
| usip-governance-2 | ||
| usip-governance-guiding-principles | ||
| usip-intro-peacebuilding | ||
| usip-intro-peacebuilding-armenian | ||
| usip-intro-peacebuilding-turkish | ||
| usip-intro-reconciliation | ||
| usip-media-arts-peace | ||
| usip-mediacion-para-paz | ||
| usip-negotiation-conflict-landscape | ||
| usip-nonviolent-action | ||
| usip-prev-election-violence | ||
| usip-psychosocial-responsive-devel | ||
| usip-religion-peacebuilding-intro | ||
| usip-religions-beliefs-human-rights | ||
| usip-religious-engagement-peacebuilding | ||
| usip-rule-of-law | ||
| usip-security-sector-governance-reform | ||
| usip-snap | ||
| usip-snap-espanol | ||
| usip-ssgr-assessment-design-eval | ||
| usip-systems-thinking | ||
| usip-un-civil-mil-coordination | ||
| usip-youth-led-peacebuilding | ||
| .gitignore | ||
| assemble.py | ||
| media-scrape.py | ||
| README.md | ||
| requirements.txt | ||
| sitemap.json | ||
| styles.css | ||
| upload.py | ||
usip-scrape
Scrapes courses from the USIP Gandhi-King Global Academy. To be used in concert with Open Web Scraper (a Chrome plugin).
Setup
Install Chrome or Chromium Browser
The required browser extension only works with Chrome
Install the Web Scraper - Free Extension
Get from the Chrome Web Store. Optionally: watch the intro video to get the gist of the workflow
Install python3 and venv/dependencies
- Make sure python3 is installed on your system
- Make sure python3-pip is installed
python3 -m ensurepip --upgrade - Ensure python3-venv in installed
python -m pip install --user virtualenv - Create a virtualenv
python -m venv ./venv - Activate the virtualenv
source venv/bin/activate - Install the dependencies
pip install -r requirements.txt
Install ffmpeg for Video Conversion
There is a command-line tool called ffmpeg that is sort of a swiss-army knife for media conversion. It is used as a post-processor after media scraping, so be sure it's installed.
Course Scrape Procedure
Navigate to your USIP GK Academy Course
Make a note of the root URL of the course content, example: https://usip-global-campus.mn.co/spaces/3223718/content; you will need it later
Open the Chrome Developer Console
Right-click / Inspect
Open the Web Scraper Tab
In the developer console, if the plugin is installed there should be a tab/view for "Web Scraper". Click on it.
Open the submenu "Create new sitemap / Import Sitemap"
In the menu, be sure to select "Import Sitemap"
Paste the contents of sitemap.json from this repo
Past the sitemap.json contents into "Sitemap JSON"; ensure the "Sitemap name" is usip-sections and click the "Import Sitemap" button
Edit the sitemap metatada
Select the imported sitemap, then click on the menu "Sitemap usip-sections / Edit metadata". Paste in the course URL you saver earlier into "Start URL 1" and then click "Save Sitemap"
Run the scrape
Go to the submenu "Sitemap usip-sections / Scrape". Select longer intervals (e.g. 3000, 5000) to limit the impact on the site during scraping. This will make the overall scrape last longer. When ready, click the "Start Scraping" button. Another window will open up. Let it run without disturbing it. Chrome will send a notification when the scrape successfully finishes.
Export the scrape data
The post-processing scripts will expect a CSV copy of the scrape data. Once the scrape has finished (not before) go to the "Sitemap usip-sections / Export data" submenu. Select "CSV" to download the file. Keep track of the resulting download; it will be needed for post-processing (pulling down copies of embedded media and reassembling the page as static content).
Media Download Procedure
The first post-processing script media-scrape.py tries to pull copies of embedded/linked media from YouTube, Soundcloud, and Google Drive.
Go to course directory (e.g. ./usip-cultural-synergy) and run the script from there:
cd ./usip-cultural-synergy
../media-scrape.py usip-sections.csv
This will take a while since it's done slowly and sequentially to prevent getting rate-limited by the content platforms. Files will end up in static/posts/downloads.
Video Conversion
Done after a media-scraper run to convert mkv to mp4 and clean up mkv files:
( cd static/posts/downloads; for f in *.mkv; do ffmpeg -n -i "./$f" -c:v libx264 -crf 23 -preset fast -c:a aac -b:a 128k "./${f%.mkv}.mp4"; done )
find static/posts/downloads -name "*.mkv" -exec {} rm \;
Static Page Assembly
This step:
- creates complete static html files for each course section
- replaces embedded media references with scraped media references (e.g. use mp4 file instead of youtube, pdf instead of google drive)
- replaces linked media references with scraped media references
- replaces usip links with mirror links
- generates a course index that links to each section
../assemble.py usip-sections.csv
...there should be minimal to no errors about missing media. If there are some, it could be that media failed to scrape, is of a currently unsupported type, was missing from the original page, etc. It's not necessarily a show-stopper but ideally there are few or no errors at this stage, so each one should be investigated. Most fixes should be done either manually or by modifying media-scrape.py or assemble.py to gracefully handle the outlier.
Static Page Upload
Assembled pages can be uploaded to the storage bucket; this is essentially a content deployment. This will upload everything in static/posts/ (including downloads) to the storage bucket.
../upload.py
Index Modification
The root common/index.html is manually updated, but new course links should be added there when they are ready. The TTL on the index file is 5-60 minutes, so if you update it and want to see the results immediately, use an Incognito window or hard refresh (usually Ctrl-F5) in your browser.
TODO
Most of the important stuff (video, audio, transcripts, course structure) is covered already but there are some improvements that could be made:
- document infra (cloudflare CDN, DNS, R2, Worker, secrets and manual bucket upload/delete maintenance)
- capture imgix content (USIP's chosen CDN) and put in the bucket, re-link
- deal with google book links
- replicate surveys
- replicate exams