Cleanup Wiki from Mass Spam: Difference between revisions
(New page: = Wiki Cleanup from Mass Spam = recently I recognized, the wiki was filled with many (>2000) new pages, all containing spam. Also I noticed there were about 20000 users registered, all sp...) |
|||
(6 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= Wiki | = Cleanup Wiki from Mass Spam = | ||
recently I recognized, the wiki was filled with many (>2000) new pages, all containing spam. | recently I recognized, the wiki was filled with many (>2000) new pages, all containing spam. | ||
Line 14: | Line 14: | ||
[[http://upload.wikimedia.org/wikipedia/commons/thumb/3/33/MediaWiki_database_schema_1-19_%28r102798%29.svg/2193px-MediaWiki_database_schema_1-19_%28r102798%29.svg.png database schema]] for this purpose. | [[http://upload.wikimedia.org/wikipedia/commons/thumb/3/33/MediaWiki_database_schema_1-19_%28r102798%29.svg/2193px-MediaWiki_database_schema_1-19_%28r102798%29.svg.png database schema]] for this purpose. | ||
My mediawiki database is mysql database wikidb. User is | My mediawiki database is mysql database wikidb. User is root. For ease of use of the mysql command I created this .my.cnf file | ||
[mysql] | [mysql] | ||
password=my_password | password=my_password | ||
user= | user=root | ||
Before any modification, do a backup (and check if the created file contains the expecteds sql commands) | |||
mkdir -p /backup/mediawiki | |||
cd /backup/mediawiki | |||
mysqldump -u root -p wikidb >mediawiki-2023-01-11-before-cleanup.sql | |||
Since I am the only user with valuable content and my user id is 1, I came up with these sql commands to clean up. Worked well as far as I can tell: | Since I am the only user with valuable content and my user id is 1, I came up with these sql commands to clean up. Worked well as far as I can tell: | ||
Line 24: | Line 30: | ||
mysql -D wikidb -e "delete from watchlist where wl_user <> 1" | mysql -D wikidb -e "delete from watchlist where wl_user <> 1" | ||
mysql -D wikidb -e "delete from ipblocks" | mysql -D wikidb -e "delete from ipblocks" | ||
mysql -D wikidb -e "delete from logging where | mysql -D wikidb -e "delete from logging where log_actor <> 1" | ||
mysql -D wikidb -e "delete from recentchanges where | mysql -D wikidb -e "delete from recentchanges where rc_actor <> 1" | ||
mysql -D wikidb -e "delete from user_properties where up_user <> 1" | |||
mysql -D wikidb -e "delete from user where user_id <> 1" | mysql -D wikidb -e "delete from user where user_id <> 1" | ||
mysql -D wikidb -e "delete from user_groups where ug_user <> 1" | |||
mysql -D wikidb -e "delete from user_former_groups where ufg_user <> 1" | |||
mysql -D wikidb -e "delete from actor where actor_user > 1" | |||
mysql -D wikidb -e "delete from comment where comment_id in (select log_comment_id from logging where log_actor > 1)" | |||
mysql -D wikidb -e "delete from watchlist where wl_user <> 1" | mysql -D wikidb -e "delete from watchlist where wl_user <> 1" | ||
mysql -D wikidb -e "delete from text where old_id = (select ar_text_id from archive where | # mysql -D wikidb -e "delete from text where old_id = (select ar_text_id from archive where ar_actor <> 1)" | ||
mysql -D wikidb -e "delete from archive where | mysql -D wikidb -e "delete from archive where ar_actor <> 1" | ||
mysql -D wikidb -e "delete from text where old_id = (select rev_text_id from revision where | mysql -D wikidb -e "delete from uploadstash where us_user <> 1" | ||
mysql -D wikidb -e "delete from revision where | mysql -D wikidb -e "delete from filearchive where fa_actor <> 1" | ||
mysql -D wikidb -e "delete from image where img_actor <> 1" | |||
mysql -D wikidb -e "delete from oldimage where oi_actor <> 1" | |||
# mysql -D wikidb -e "delete from text where old_id = (select rev_text_id from revision where rev_actor <> 1)" | |||
mysql -D wikidb -e "delete from revision where rev_actor <> 1" | |||
mysql -D wikidb -e "delete from page where not exists (select 1 from revision r where page.page_id = r.rev_page)" | mysql -D wikidb -e "delete from page where not exists (select 1 from revision r where page.page_id = r.rev_page)" | ||
mysql -D wikidb -e "delete from page where page_id in (select rev_page from revision where rev_parent_id = 0 and rev_actor <> 1)" | |||
This needs finetuning: lists entries from text table that has no internal use and no good pages associated. | |||
It extracts page text where no content id of slots of revisions from known good users exist and keeps internal messages (with id <= 1637). | |||
mysql -D wikidb -e "select old_id, substring(old_text, 1, 100) from text where old_id > 1637 and old_id not in (select cast(substring(content_address, 4) as unsigned) from revision, slots, content where rev_id = slot_revision_id and slot_content_id = content_id and content_address like 'tt:%' and rev_actor in (select distinct rev_actor from revision, page where page_id = rev_page) )" | head | |||
Once this returns just spam (maybe also try without head), convert it to a delete statement. | |||
Only trace left of spammers are log entries from my earlier manual cleanup attempts. Can live with that :) | Only trace left of spammers are log entries from my earlier manual cleanup attempts. Can live with that :) | ||
Sometimes single pages get lost after this procedure. Check all pages after cleanup. | |||
If pages are lost: restore, save edit content of the relevant pages, repeat cleanup, restore content. | |||
mysql -e "drop database wikidb" | |||
mysql -e "create database wikidb" | |||
cat /backup/mediawiki/mediawiki-2023-01-11-before-cleanup.sql | mysql wikidb |
Latest revision as of 19:07, 12 January 2023
Cleanup Wiki from Mass Spam
recently I recognized, the wiki was filled with many (>2000) new pages, all containing spam. Also I noticed there were about 20000 users registered, all spammers or without contributions.
I decided to allow edit only for registered users and deactivated automatic registering. So I put this in LocalSettings.php:
$wgGroupPermissions['*']['edit'] = false; $wgGroupPermissions['*']['createaccount'] = false;
Since I found no useful plugin to remove that many users and pages I did some research on the database table level. Mediawiki provided a useful [database schema] for this purpose.
My mediawiki database is mysql database wikidb. User is root. For ease of use of the mysql command I created this .my.cnf file
[mysql] password=my_password user=root
Before any modification, do a backup (and check if the created file contains the expecteds sql commands)
mkdir -p /backup/mediawiki cd /backup/mediawiki mysqldump -u root -p wikidb >mediawiki-2023-01-11-before-cleanup.sql
Since I am the only user with valuable content and my user id is 1, I came up with these sql commands to clean up. Worked well as far as I can tell:
mysql -D wikidb -e "delete from watchlist where wl_user <> 1" mysql -D wikidb -e "delete from ipblocks" mysql -D wikidb -e "delete from logging where log_actor <> 1" mysql -D wikidb -e "delete from recentchanges where rc_actor <> 1" mysql -D wikidb -e "delete from user_properties where up_user <> 1" mysql -D wikidb -e "delete from user where user_id <> 1" mysql -D wikidb -e "delete from user_groups where ug_user <> 1" mysql -D wikidb -e "delete from user_former_groups where ufg_user <> 1" mysql -D wikidb -e "delete from actor where actor_user > 1" mysql -D wikidb -e "delete from comment where comment_id in (select log_comment_id from logging where log_actor > 1)" mysql -D wikidb -e "delete from watchlist where wl_user <> 1" # mysql -D wikidb -e "delete from text where old_id = (select ar_text_id from archive where ar_actor <> 1)" mysql -D wikidb -e "delete from archive where ar_actor <> 1" mysql -D wikidb -e "delete from uploadstash where us_user <> 1" mysql -D wikidb -e "delete from filearchive where fa_actor <> 1" mysql -D wikidb -e "delete from image where img_actor <> 1" mysql -D wikidb -e "delete from oldimage where oi_actor <> 1" # mysql -D wikidb -e "delete from text where old_id = (select rev_text_id from revision where rev_actor <> 1)" mysql -D wikidb -e "delete from revision where rev_actor <> 1" mysql -D wikidb -e "delete from page where not exists (select 1 from revision r where page.page_id = r.rev_page)" mysql -D wikidb -e "delete from page where page_id in (select rev_page from revision where rev_parent_id = 0 and rev_actor <> 1)"
This needs finetuning: lists entries from text table that has no internal use and no good pages associated. It extracts page text where no content id of slots of revisions from known good users exist and keeps internal messages (with id <= 1637).
mysql -D wikidb -e "select old_id, substring(old_text, 1, 100) from text where old_id > 1637 and old_id not in (select cast(substring(content_address, 4) as unsigned) from revision, slots, content where rev_id = slot_revision_id and slot_content_id = content_id and content_address like 'tt:%' and rev_actor in (select distinct rev_actor from revision, page where page_id = rev_page) )" | head
Once this returns just spam (maybe also try without head), convert it to a delete statement.
Only trace left of spammers are log entries from my earlier manual cleanup attempts. Can live with that :)
Sometimes single pages get lost after this procedure. Check all pages after cleanup. If pages are lost: restore, save edit content of the relevant pages, repeat cleanup, restore content.
mysql -e "drop database wikidb" mysql -e "create database wikidb" cat /backup/mediawiki/mediawiki-2023-01-11-before-cleanup.sql | mysql wikidb