Page MenuHomePhabricator

Sanitize and ingest all event tables into the event_sanitized database
Closed, ResolvedPublic

Description

We are slowly creating more streams in Event Platform, but we don't currently sanitize and save them in the event_sanitized database like we do for legacy EventLogging data.

For example: mediawiki_mediasearch_interaction

Acceptance criteria:

  • Allowlist system – file(s) where users can specify retention policies for streams
  • Sanitization job that sanitizes event data from Event Platform streams according to retention policies specified with the allowlist system

TODO:

  • Refactor EventLoggingSanitization job to something more generic: RefineSanitize
  • Move all event_sanitized partitions to lowercased directory names to avoid re-refining data
  • Apply backwards compatible usage of RefineSanitize for eventlogging
  • Create new generic RefineSanitize job to be able to sanitize any data
  • Copy all existent data that generic RefineSanitize job targets into event_sanitized
  • Create data purge job to remove data after 90 days from all tables in event

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+4 -4
operations/puppetproduction+16 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -13
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+23 -24
analytics/refinerymaster+1 -1
analytics/refinerymaster+3 -2
operations/puppetproduction+3 -3
operations/puppetproduction+2 -4
analytics/refinery/sourcemaster+3 -2
operations/puppetproduction+14 -1
analytics/refinerymaster+16 -0
analytics/refinerymaster+21 -0
operations/puppetproduction+51 -24
operations/puppetproduction+13 -64
operations/puppetproduction+6 -2
operations/puppetproduction+184 -100
operations/puppetproduction+69 -30
analytics/refinerymaster+1 -0
operations/puppetproduction+42 -4
analytics/refinery/sourcemaster+12 -5
operations/puppetproduction+21 -0
operations/puppetproduction+3 -0
operations/puppetproduction+1 -1
operations/puppetproduction+149 -45
operations/puppetproduction+5 -5
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+12 -1
operations/puppetproduction+5 -0
operations/puppetproduction+2 -2
operations/puppetproduction+96 -0
operations/puppetproduction+1 -1
operations/puppetproduction+20 -20
operations/puppetproduction+84 -126
operations/puppetproduction+10 -0
analytics/refinery/sourcemaster+487 -523
analytics/refinery/sourcemaster+1 K -805
analytics/refinery/sourcemaster+194 -175
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 678608 merged by Ottomata:

[operations/puppet@production] refine - use refinery 0.1.4

https://gerrit.wikimedia.org/r/678608

Change 678836 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Refine - fix typo in job_config

https://gerrit.wikimedia.org/r/678836

Change 678836 merged by Ottomata:

[operations/puppet@production] Refine - fix typo in job_config

https://gerrit.wikimedia.org/r/678836

Change 676380 merged by Ottomata:

[operations/puppet@production] Set up refine_sanitize jobs in analytics test cluster.

https://gerrit.wikimedia.org/r/676380

Change 678857 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitize - use proper refinery path

https://gerrit.wikimedia.org/r/678857

Change 678857 merged by Ottomata:

[operations/puppet@production] test/refine_sanitize - use proper refinery path

https://gerrit.wikimedia.org/r/678857

Change 678863 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Ensure refinery/python on PYTHONPATH for refinery-eventlogging-saltrotate in test

https://gerrit.wikimedia.org/r/678863

Change 678863 merged by Ottomata:

[operations/puppet@production] Ensure refinery/python on PYTHONPATH for refinery-eventlogging-saltrotate

https://gerrit.wikimedia.org/r/678863

Change 678867 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitize - ensure hdfs salts dir exists, and use -f when running saltrotate rm

https://gerrit.wikimedia.org/r/678867

Change 678867 merged by Ottomata:

[operations/puppet@production] test/refine_sanitize - ensure hdfs salts dir exists

https://gerrit.wikimedia.org/r/678867

Change 678876 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_santize - Use normalized lowercase table name

https://gerrit.wikimedia.org/r/678876

Change 678876 merged by Ottomata:

[operations/puppet@production] test/refine_santize - Use normalized lowercase table name

https://gerrit.wikimedia.org/r/678876

Change 678941 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Refactor EventLoggingSanitization using RefineSanitize

https://gerrit.wikimedia.org/r/678941

Change 679376 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine - lowercase eventlogging legeacy table names in include/exclude regexes

https://gerrit.wikimedia.org/r/679376

Change 679846 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine - use 0.1.5 and lowercase table regexes

https://gerrit.wikimedia.org/r/679846

Change 679846 merged by Ottomata:

[operations/puppet@production] test/refine - use 0.1.5 and lowercase table regexes

https://gerrit.wikimedia.org/r/679846

Change 679376 merged by Ottomata:

[operations/puppet@production] refine - use 0.1.5 and lowercase table names in include/exclude regexes

https://gerrit.wikimedia.org/r/679376

Change 679939 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitize - make salt rotation more generic

https://gerrit.wikimedia.org/r/679939

Change 679939 merged by Ottomata:

[operations/puppet@production] test/refine_sanitize - make salt rotation more generic

https://gerrit.wikimedia.org/r/679939

Change 679941 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine_sanitize_salt_rotate - use local_salts_prefix properly

https://gerrit.wikimedia.org/r/679941

Change 679941 merged by Ottomata:

[operations/puppet@production] refine_sanitize_salt_rotate - use local_salts_prefix properly

https://gerrit.wikimedia.org/r/679941

Change 680382 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] SanitizeTransformation - Just some simple logging improvements.

https://gerrit.wikimedia.org/r/680382

Change 681069 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitize - absent sanitize_eventlogging_analytics_delayed_test until June

https://gerrit.wikimedia.org/r/681069

Change 681069 merged by Ottomata:

[operations/puppet@production] test/refine_sanitize - absent sanitize delayed_test until June

https://gerrit.wikimedia.org/r/681069

Change 681092 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] analytics test - import mediawiki.page-data with camus and refine it

https://gerrit.wikimedia.org/r/681092

Change 681092 merged by Ottomata:

[operations/puppet@production] analytics test - import mediawiki.page-delete with camus and refine it

https://gerrit.wikimedia.org/r/681092

Change 681099 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] event_sanitized_allowlist - mediawiki_page_delete: keep_all

https://gerrit.wikimedia.org/r/681099

Change 680382 merged by Ottomata:

[analytics/refinery/source@master] SanitizeTransformation - Just some simple logging improvements.

https://gerrit.wikimedia.org/r/680382

Change 679961 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitized - add a general purpose event_sanitized job

https://gerrit.wikimedia.org/r/679961

Change 681105 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitized - add a general purpose event_sanitized job

https://gerrit.wikimedia.org/r/681105

Change 679961 abandoned by Ottomata:

[operations/puppet@production] test/refine_sanitized - add a general purpose event_sanitized job

Reason:

Iae3a76a37169a852e9698157b6b06e5b1e99f47a

https://gerrit.wikimedia.org/r/679961

Change 681099 merged by Ottomata:

[analytics/refinery@master] event_sanitized_main_allowlist - mediawiki_page_delete: keep_all

https://gerrit.wikimedia.org/r/681099

Change 681105 merged by Ottomata:

[operations/puppet@production] test/refine_sanitized - add a event_sanitized_main job

https://gerrit.wikimedia.org/r/681105

Change 678941 merged by Ottomata:

[operations/puppet@production] Move sanitize_eventlogging_analytics jobs from data_purge to refine_santizie

https://gerrit.wikimedia.org/r/678941

Change 681784 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] sanitize_eventlogging_analytics_immediate - ensure absent during switchover

https://gerrit.wikimedia.org/r/681784

Change 681784 merged by Ottomata:

[operations/puppet@production] sanitize_eventlogging_analytics_immediate - ensure absent during switchover

https://gerrit.wikimedia.org/r/681784

Change 681991 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine_sanitize - use refinery 0.1.6 with RefineSanitize job

https://gerrit.wikimedia.org/r/681991

Change 681991 merged by Ottomata:

[operations/puppet@production] refine_sanitize - use refinery 0.1.6 with RefineSanitize job

https://gerrit.wikimedia.org/r/681991

Change 682175 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine - allow configuring RefineMonitors since and until params

https://gerrit.wikimedia.org/r/682175

Change 682175 merged by Ottomata:

[operations/puppet@production] refine - allow configuring RefineMonitors since and until params

https://gerrit.wikimedia.org/r/682175

Change 682986 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Rename and symlink sanitization eventlogging/whitelist.yaml

https://gerrit.wikimedia.org/r/682986

Change 683043 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Add more tables to sanitize in event_sanitized_main_allowlist

https://gerrit.wikimedia.org/r/683043

Change 683053 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/data_purge - add drop_event job

https://gerrit.wikimedia.org/r/683053

Change 682986 merged by Ottomata:

[analytics/refinery@master] Rename and symlink sanitization eventlogging/whitelist.yaml

https://gerrit.wikimedia.org/r/682986

Change 683043 merged by Ottomata:

[analytics/refinery@master] Add more tables to sanitize in event_sanitized_main_allowlist

https://gerrit.wikimedia.org/r/683043

Change 683053 merged by Ottomata:

[operations/puppet@production] test/data_purge - add drop_event job

https://gerrit.wikimedia.org/r/683053

Mentioned in SAL (#wikimedia-analytics) [2021-04-28T12:50:06Z] <ottomata> applied data_purge jobs in analytics test cluster; old data will now be dropped there - T273789

Change 683275 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] RefineSanitizeMonitor - pass keep_all_enabled to SanitizeTransformation.loadAllowlist

https://gerrit.wikimedia.org/r/683275

Change 683275 merged by Ottomata:

[analytics/refinery/source@master] RefineSanitizeMonitor - pass keep_all_enabled to SanitizeTransformation.loadAllowlist

https://gerrit.wikimedia.org/r/683275

To prep for moving historical event data to event_sanitized, we need to copy the tables we'll be keeping forever there. Procedure:

# On an-launcher1002 in /home/otto
mkdir event_sanitized_main_T280813
cd event_sanitized_main_T280813

tables="
mediawiki_centralnotice_campaign_change \
mediawiki_centralnotice_campaign_create \
mediawiki_page_create \
mediawiki_page_delete \
mediawiki_page_links_change \
mediawiki_page_move \
mediawiki_page_restrictions_change \
mediawiki_page_suppress \
mediawiki_page_undelete \
mediawiki_revision_create \
mediawiki_revision_recommendation_create \
mediawiki_revision_score \
mediawiki_revision_tags_change \
mediawiki_revision_visibility_change \
mediawiki_user_blocks_change"

distcp_source_file=./distcp_source.txt
truncate -s 0 $distcp_source_file

for table in $tables; do
    path_orig="hdfs://analytics-hadoop/wmf/data/event/$table"
    path_new="hdfs://analytics-hadoop/wmf/data/event_sanitized/$table"

    # append the distcp source_file with path_orig.
    echo $path_orig >> $distcp_source_file

    # Build Hive scripts to create and add partitions to the event_sanitized table
    hive_sql_file=./${table}_create.hql
    truncate -s 0 $hive_sql_file

    # Build Hive scripts to recreate the tables with new lowercased partitions and then repair the tables to recreate the partitions.
    echo "SET mapred.child.java.opts=-Xmx8G -XX:+UseConcMarkSweepGC  -XX:-UseGCOverheadLimit;" >> $hive_sql_file
    echo "SET hive.msck.repair.batch.size=5000;" >> $hive_sql_file;
    echo "CREATE EXTERNAL TABLE IF NOT EXISTS event_sanitized.${table} LIKE event.${table};" >> $hive_sql_file
    echo "ALTER TABLE event_sanitized.${table} SET LOCATION '$path_new';" >> $hive_sql_file
    echo "MSCK REPAIR TABLE event_sanitized.${table};" >> $hive_sql_file
done


# do it!
mkdir ./done

# First run the distcp to copy all table directories into /wmf/data/event_sanitized
sudo -u analytics kerberos-run-command analytics hdfs dfs -put $distcp_source_file /tmp/distcp_event_sanitized_main_T280813.txt
time sudo -u analytics kerberos-run-command analytics hadoop distcp -m 60 -p -f /tmp/distcp_event_sanitized_main_T280813.txt hdfs://analytics-hadoop/wmf/data/event_sanitized
mv $distcp_source_file ./done/

for table in $tables; do
    echo "Creating and repairing event_sanitized.$table"
    time sudo -u analytics kerberos-run-command analytics hive -f ./${table}_create.hql && mv -v ./${table}_create.hql ./done/
    echo ""
done

Change 683375 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Update refine_sanitize job to use event_sanitized_analytics_allowlist

https://gerrit.wikimedia.org/r/683375

Change 683376 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] canary_events and refine_sanitize - use refinery 0.1.9

https://gerrit.wikimedia.org/r/683376

Change 683375 merged by Ottomata:

[operations/puppet@production] Update refine_sanitize job to use event_sanitized_analytics_allowlist

https://gerrit.wikimedia.org/r/683375

Change 683376 merged by Ottomata:

[operations/puppet@production] canary_events and refine_sanitize - use refinery 0.1.9

https://gerrit.wikimedia.org/r/683376

Change 683421 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Update event_sanitized_main_allowlist

https://gerrit.wikimedia.org/r/683421

For reference, the current sizes of the tables to move to event_sanitized are:

9.3 M  /wmf/data/event/mediawiki_centralnotice_campaign_change
1.4 M  /wmf/data/event/mediawiki_centralnotice_campaign_create
39.7 G  /wmf/data/event/mediawiki_page_create
4.4 G  /wmf/data/event/mediawiki_page_delete
309.5 G  /wmf/data/event/mediawiki_page_links_change
2.5 G  /wmf/data/event/mediawiki_page_move
2.5 K  /wmf/data/event/mediawiki_page_properties_change
428.5 M  /wmf/data/event/mediawiki_page_restrictions_change
10.4 M  /wmf/data/event/mediawiki_page_suppress
375.9 M  /wmf/data/event/mediawiki_page_undelete
392.1 G  /wmf/data/event/mediawiki_revision_create
13.9 M  /wmf/data/event/mediawiki_revision_recommendation_create
212.6 G  /wmf/data/event/mediawiki_revision_score
123.5 G  /wmf/data/event/mediawiki_revision_tags_change
1.1 G  /wmf/data/event/mediawiki_revision_visibility_change
3.8 G  /wmf/data/event/mediawiki_user_blocks_change
4.1 T  /wmf/data/event/resource_change

FYI: We are not going to sanitize and keep resource_change

distcp complete, application_id: application_1619507802557_6586

DistCp Counters
        Bytes Copied=1170743820170
        Bytes Expected=1170743820170
        Files Copied=1955833
        DIR_COPY=331299

Moving on to creating and repairing hive tables.

Change 683421 merged by Ottomata:

[analytics/refinery@master] Update event_sanitized_main_allowlist

https://gerrit.wikimedia.org/r/683421

Mentioned in SAL (#wikimedia-operations) [2021-04-29T13:37:36Z] <otto@deploy1002> Started deploy [analytics/refinery@b3c5820]: update event_sanitized_main allowlst on an-launcher1002 - T273789

Mentioned in SAL (#wikimedia-operations) [2021-04-29T13:40:35Z] <otto@deploy1002> Finished deploy [analytics/refinery@b3c5820]: update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 59s)

Change 683614 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Enable event_sanitized_main_immediate job

https://gerrit.wikimedia.org/r/683614

Change 683620 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Add comment in allowlist about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683620

Change 683620 merged by Ottomata:

[analytics/refinery@master] Add comment in allowlist about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683620

Change 683622 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Add comment in refine about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683622

Change 683614 merged by Ottomata:

[operations/puppet@production] Enable event_sanitized_main_immediate job

https://gerrit.wikimedia.org/r/683614

Change 683622 merged by Ottomata:

[operations/puppet@production] Add comment in refine about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683622

Mentioned in SAL (#wikimedia-analytics) [2021-04-29T15:38:10Z] <ottomata> enabling event_sanitized_main jobs - T273789

Mentioned in SAL (#wikimedia-operations) [2021-04-29T16:15:37Z] <otto@deploy1002> Started deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789

Mentioned in SAL (#wikimedia-operations) [2021-04-29T16:18:17Z] <otto@deploy1002> Finished deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 39s)

Backfilling since yesterday's distcp was started.

sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event_sanitized_main_immediate --since=72

Change 683698 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Remove refinery-drop-webrequest-sampled-druid job from test cluster

https://gerrit.wikimedia.org/r/683698

Change 683698 merged by Ottomata:

[operations/puppet@production] Remove refinery-drop-webrequest-sampled-druid job from test cluster

https://gerrit.wikimedia.org/r/683698

Change 683890 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Remove absented druid data drop job from test cluster

https://gerrit.wikimedia.org/r/683890

Change 683890 merged by Ottomata:

[operations/puppet@production] Remove absented druid data drop job from test cluster

https://gerrit.wikimedia.org/r/683890

ah, I need to reset the _REFINED flag mtimes for the hours I distcp-ed over, just like I did in https://phabricator.wikimedia.org/T280813#7028014. There's more data this time. Here's what I'm running now:


import org.wikimedia.analytics.refinery.job.refine._
import com.github.nscala_time.time.Imports._
import org.joda.time.format.ISODateTimeFormat
import org.apache.hadoop.fs.Path
import org.joda.time.Days

val isoDateTimeFormatter = ISODateTimeFormat.dateTimeParser()

// Have to split this into smaller chunks; default non yarn spark shell couldn't handle all partitions in this range at once.
val begin = DateTime.parse("2021-03-14T00:00:00Z", isoDateTimeFormatter)
val end = DateTime.parse("2021-05-01T00:00:00Z", isoDateTimeFormatter)


def daysInBetween(d1: DateTime, d2: DateTime): Seq[DateTime] = {
    val oldestDay = new DateTime(d1, DateTimeZone.UTC).hourOfDay.roundCeilingCopy
    val youngestDay = new DateTime(d2, DateTimeZone.UTC).hourOfDay.roundFloorCopy

    for (d <- 0 until Days.daysBetween(oldestDay, youngestDay).getDays) yield {
        oldestDay + d.days
    }
}

val days = daysInBetween(begin, end)
val outputPath = new Path("/wmf/data/event_sanitized")

for (i <- 0 until days.size - 1) yield {
    val since = days(i)
    val until = days(i + 1)

    println("==== ", since, until, " ====")

    val rts = RefineTarget.find(
        spark,
        "event",
        outputPath,
        "event_sanitized",
        since,
        until,
        None,
        None
    )

    val doneRts = rts.filter(_.doneFlagExists).par
    doneRts.foreach(_.writeDoneFlag)
    println("==== DONE ", since, until, doneRts.size, " ====")
    println("")
}

Change 684427 had a related patch set uploaded (by Mforns; author: Mforns):

[operations/puppet@production] analytics:refinery:job:test:data_purge: remove -skipTrash from drop_event

https://gerrit.wikimedia.org/r/684427

Change 684427 merged by Ottomata:

[operations/puppet@production] analytics:refinery:job:test:data_purge: remove -skipTrash from drop_event

https://gerrit.wikimedia.org/r/684427

Change 689925 had a related patch set uploaded (by Mforns; author: Mforns):

[operations/puppet@production] analytics:refinery:job:data_purge: remove -skipTrash from drop_event

https://gerrit.wikimedia.org/r/689925

Mentioned in SAL (#wikimedia-analytics) [2021-05-12T15:17:39Z] <ottomata> dropped event.mediawiki_job_* tables and data directories with mforns - T273789

Change 689925 merged by Ottomata:

[operations/puppet@production] analytics:refinery:job:data_purge: improve drop-el-unsanitized-events

https://gerrit.wikimedia.org/r/689925

Change 690038 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] data_purge - Refactor drop-el-unsanitized-events in to drop_event job

https://gerrit.wikimedia.org/r/690038

Change 690038 merged by Ottomata:

[operations/puppet@production] data_purge - Refactor drop-el-unsanitized-events in to drop_event job

https://gerrit.wikimedia.org/r/690038

Mentioned in SAL (#wikimedia-operations) [2021-05-12T20:57:13Z] <ottomata> starting new drop_event data purge job to drop all event data older than 90 days in the Hive event database - T273789

Change 675936 abandoned by Ottomata:

[operations/puppet@production] refine: rename EventLoggingSanitization to RefineSanitize

Reason:

Fixed elsewhere

https://gerrit.wikimedia.org/r/675936