Jump to content

Nova Resource:Copypatrol

From Wikitech


Project Name copypatrol
Details,
admins/members
openstack-browser
Monitoring

CopyPatrol

Description

CopyPatrol is a tool that allows you to see recent Wikipedia edits that are flagged as possible copyright violations.

Project status

currently running

Contact address

This page documents how to create a VPS instance of CopyPatrol. This consists of four components – the frontend webserver (Apache), the frontend app (PHP/Symfony), and the backend webserver (nginx) and application (Python).

Frontend

Web server

Install and configure Apache and PHP.

sudo apt -y install php php-common php-cli php-fpm php-json php-xml php-intl php-curl php-apcu php-mysql apache2 libapache2-mod-php cron

Create the web server configuration file at /etc/apache2/sites-available/copypatrol.conf with the following:

<VirtualHost *:80>
        DocumentRoot /var/www/public
        ServerName copypatrol.wmcloud.org

        # Requests with these user agents are denied.
        SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel|MJ12bot|AhrefsBot|dotbot|amp-cloud|naver\.me\/spd|Adsbot|linkfluence|coccocbot|sqlmap)" bad_bot=yes

        CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes')
        CustomLog ${APACHE_LOG_DIR}/denied.log combined expr=(reqenv('bad_bot')=='yes')
        ErrorLog ${APACHE_LOG_DIR}/error.log

        <Directory /var/www/public/>
             Options Indexes FollowSymLinks
             AllowOverride All
             Require all granted
             DirectoryIndex index.php
             RewriteEngine On
             RewriteRule ^index\.php$ - [L]
             RewriteCond %{REQUEST_FILENAME} !-f
             RewriteCond %{REQUEST_FILENAME} !-d
             RewriteRule . /index.php [L]
        </Directory>

        <Directory /var/www/>
                Options Indexes FollowSymLinks
                AllowOverride None
                Require all granted
                Deny from env=bad_bot
        </Directory>

        ErrorDocument 403 "Access denied"
        RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC]
        RewriteRule .* - [R=403,L]
        RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get"
        RewriteRule .* - [R=403,L]
        
        RewriteEngine On
        RewriteCond %{HTTP:X-Forwarded-Proto} !https
        RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L]
</VirtualHost>

Enable various Apache modules, and the web server configuration (and disable the default site, which isn't used):

sudo a2enmod php8.2 rewrite
sudo a2ensite copypatrol
sudo a2dissite 000-default
sudo apache2ctl graceful

Application

Install dependencies:

sudo apt install composer

Clone the repository, first removing the html/ directory created by Apache.

cd /var/www && sudo rm -rf html
sudo git clone https://github.com/wikimedia/CopyPatrol.git .

Copy .env to .env.local and adjust accordingly, ensuring to change APP_ENV to prod. You can also remove all rows for REPLICAS_HOST_, REPLICAS_PORT_, TOOLSDB_HOST and TOOLSDB_PORT as they are already properly defined in .env.prod.

Install the composer packages and restore ownership of files to www-data (Apache):

sudo composer install --no-dev -o
sudo chown -R www-data:www-data .

Add the cron job to update the app when there's a new tagged release with sudo crontab -e -u www-data then add:

MAILTO=copypatrol@toolforge.org
*/10 * * * * /var/www/vendor/wikimedia/toolforge-bundle/bin/deploy.sh prod /var/www/

Backend

See https://github.com/JJMC89/copypatrol-backend/tree/main/.vps.

Server admin log

2024-10-31

  • 06:50 JJMC89: copypatrol-backend-prod-01 sudo /var/www/.venv/bin/pip install --upgrade --upgrade-strategy eager pip setuptools wheel /var/www/copypatrol-backend # apply security updates

2024-08-15

  • 17:01 JJMC89: copypatrol-backend-prod-01 deploy ed212c4..81ec727

2024-07-22

  • 16:58 andrewbogott: upgrading db servers copypatrol-prod-db-01 and copypatrol-dev-db-01 to latest Trove guest image

2024-06-23

  • 00:31 JJMC89: copypatrol-backend-prod-01 deploy f61d2c0..ed212c4

2024-06-21

  • 03:03 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.o... (more)