Cleanup Wiki from Mass Spam: Difference between revisions

From JoBaPedia
Jump to navigation Jump to search
Line 49: Line 49:
  mysql -D wikidb -e "delete from page where not exists (select 1 from revision r where page.page_id = r.rev_page)"
  mysql -D wikidb -e "delete from page where not exists (select 1 from revision r where page.page_id = r.rev_page)"
  mysql -D wikidb -e "delete from page where page_id in (select rev_page from revision where rev_parent_id = 0 and rev_actor <> 1)"
  mysql -D wikidb -e "delete from page where page_id in (select rev_page from revision where rev_parent_id = 0 and rev_actor <> 1)"
This needs finetuning: remove entries from text table that has no internal use and no pages associated.
It extracts page text where no content id of slots of revisions from known good users exist.
mysql -D wikidb -e "select old_id, substring(old_text, 1, 100) from text where old_id > 1637 and old_id not in (select cast(substring(content_address, 4) as unsigned) from revision, slots, content where rev_id = slot_revision_id and slot_content_id = content_id and content_address like 'tt:%' and rev_actor in (select distinct rev_actor from revision, page where page_id = rev_page) )" | head
Once this returns just spam (maybe also try without head), convert it to a delete statement.


Only trace left of spammers are log entries from my earlier manual cleanup attempts. Can live with that :)
Only trace left of spammers are log entries from my earlier manual cleanup attempts. Can live with that :)

Revision as of 19:05, 12 January 2023

Cleanup Wiki from Mass Spam

recently I recognized, the wiki was filled with many (>2000) new pages, all containing spam. Also I noticed there were about 20000 users registered, all spammers or without contributions.

I decided to allow edit only for registered users and deactivated automatic registering. So I put this in LocalSettings.php:

$wgGroupPermissions['*']['edit'] = false;
$wgGroupPermissions['*']['createaccount'] = false;

Since I found no useful plugin to remove that many users and pages I did some research on the database table level. Mediawiki provided a useful [database schema] for this purpose.

My mediawiki database is mysql database wikidb. User is root. For ease of use of the mysql command I created this .my.cnf file

[mysql]
password=my_password
user=root

Before any modification, do a backup (and check if the created file contains the expecteds sql commands)

mkdir -p /backup/mediawiki
cd /backup/mediawiki
mysqldump -u root -p wikidb >mediawiki-2023-01-11-before-cleanup.sql

Since I am the only user with valuable content and my user id is 1, I came up with these sql commands to clean up. Worked well as far as I can tell:

mysql -D wikidb -e "delete from watchlist where wl_user <> 1"
mysql -D wikidb -e "delete from ipblocks"
mysql -D wikidb -e "delete from logging where log_actor <> 1"
mysql -D wikidb -e "delete from recentchanges where rc_actor <> 1"
mysql -D wikidb -e "delete from user_properties where up_user <> 1"
mysql -D wikidb -e "delete from user where user_id <> 1"
mysql -D wikidb -e "delete from user_groups where ug_user <> 1"
mysql -D wikidb -e "delete from user_former_groups where ufg_user <> 1"
mysql -D wikidb -e "delete from actor where actor_user > 1"
mysql -D wikidb -e "delete from comment where comment_id in (select log_comment_id from logging where log_actor > 1)"
mysql -D wikidb -e "delete from watchlist where wl_user <> 1"
# mysql -D wikidb -e "delete from text where old_id = (select ar_text_id from archive where ar_actor <> 1)"
mysql -D wikidb -e "delete from archive where ar_actor <> 1"
mysql -D wikidb -e "delete from uploadstash where us_user <> 1"
mysql -D wikidb -e "delete from filearchive where fa_actor <> 1"
mysql -D wikidb -e "delete from image where img_actor <> 1"
mysql -D wikidb -e "delete from oldimage where oi_actor <> 1"
# mysql -D wikidb -e "delete from text where old_id = (select rev_text_id from revision where rev_actor <> 1)"
mysql -D wikidb -e "delete from revision where rev_actor <> 1"
mysql -D wikidb -e "delete from page where not exists (select 1 from revision r where page.page_id = r.rev_page)"
mysql -D wikidb -e "delete from page where page_id in (select rev_page from revision where rev_parent_id = 0 and rev_actor <> 1)"

This needs finetuning: remove entries from text table that has no internal use and no pages associated. It extracts page text where no content id of slots of revisions from known good users exist.

mysql -D wikidb -e "select old_id, substring(old_text, 1, 100) from text where old_id > 1637 and old_id not in (select cast(substring(content_address, 4) as unsigned) from revision, slots, content where rev_id = slot_revision_id and slot_content_id = content_id and content_address like 'tt:%' and rev_actor in (select distinct rev_actor from revision, page where page_id = rev_page) )" | head

Once this returns just spam (maybe also try without head), convert it to a delete statement.

Only trace left of spammers are log entries from my earlier manual cleanup attempts. Can live with that :)

Sometimes single pages get lost after this procedure. Check all pages after cleanup. If pages are lost: restore, save edit content of the relevant pages, repeat cleanup, restore content.

mysql -e "drop database wikidb"
mysql -e "create database wikidb"
cat /backup/mediawiki/mediawiki-2023-01-11-before-cleanup.sql | mysql wikidb