User Details
- User Since
- Feb 11 2015, 6:02 PM (476 w, 14 h)
- Availability
- Available
- IRC Nick
- joal
- LDAP User
- Unknown
- MediaWiki User
- JAllemandou (WMF) [ Global Accounts ]
Yesterday
Thanks a lot @nshahquinn-wmf :)
Tue, Mar 26
We currently have use-cases doing this exactly that work. there must have been another issue than the pone described here. I think this ticket is invalid.
Done using airflow variable mechanism.
Thu, Mar 21
I have run our script to list user content in our various machines, the result is below.
@AndrewTavis_WMDE , I let you review and let us know when you have copied stuff you wish to keep, so that we can delete the rest.
Wed, Mar 6
Thu, Feb 29
I think we're gonna use this ticket: https://phabricator.wikimedia.org/T262201
Nothing done on my end - possibly one of the 2 jobs failed for real?
Indeed, the job will not be affected with next month changes:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blame/main/analytics/dags/clickstream/clickstream_monthly_dag.py#L66
We'll need to keep looking for when those change though :)
Wed, Feb 28
We're gonna build a quickfix for next month sqoop to be successful (null values in dropped fields for some projects).
Tue, Feb 27
Feb 22 2024
I don't think this change would really affect queries performance, but I'm in favor of doing it for the benefit of relieving some pressure from Kerberos.
Have we looked around to see if there are existing 'dataset' config formats/specs we can already use?
Feb 21 2024
Implementation plan:
- Add a new skip option in https://github.com/wikimedia/analytics-refinery/blob/master/bin/import-mediawiki-dumps#L29 to skip wikis from the wiki-list file the job reads.
- Use this new option to skip wikidatawiki in the puppet setup systemdtimer:
- https://github.com/wikimedia/operations-puppet/blob/7707b14401ffc97e0adc136850f670c826552049/modules/profile/templates/analytics/refinery/job/refinery-import-mediawiki-dumps.sh.erb
- https://github.com/wikimedia/operations-puppet/blob/7707b14401ffc97e0adc136850f670c826552049/modules/profile/manifests/analytics/refinery/job/import_mediawiki_dumps_config.pp
- https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/analytics/refinery/job/import_mediawiki_dumps.pp
Feb 12 2024
Thank you so much @Ladsgroup for the recap.
Feb 8 2024
Hi @Ladsgroup,
I have a question for you: have all the projects been migrated to using the new linktarget table for the pagelinks table, even if their columns have not been removed?
I'm asking this for us to adapt our sqoop jobs, as we're starting to experience issues (only testwiki this month).
Feb 7 2024
This has started, testwiki schema has changed.
I'd also like to talk about https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L622 as the linktarget table is considerate private now.
Feb 5 2024
Feb 1 2024
Data engineering team has written some code for our cassandra-loading jobs to be able to read a password from a file on HDFS:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/utils/WmfCassandraAuthConfFactory.scala
While that could be useful, the spark-thrift server doesn't support user impersonation. The StackOverflow ticket I have read points to https://github.com/apache/kyuubi. We could investigate this.
Jan 30 2024
Go for it :)
Jan 29 2024
Thanks a log for not forgetting about this ticket @mfossati :)
the Data Engineering team is on the road toward providing you with (hopefully) an easy enough way to configure data deletion for your datasets.
In the meantime, manual deletion every now and then should be enough.
I don't think it's worth investing time on this before the new system comes in (probably a few months).
Is that ok for you?
Does your tooling let you control size and throughput?
Jan 26 2024
I think this is a good idea :)
The smaller size shouldn't be an issue as we should not test scalability but functions.
Jan 24 2024
The task is old but the objective is still valid IMO.
We should talk to @Eevans about this.
Closing as the strategy is to migrate to Iceberg.
Jan 23 2024
Jan 19 2024
Jan 18 2024
Blocked on https://phabricator.wikimedia.org/T355352
Jan 11 2024
Jan 10 2024
@brennen has updated my rights on gitlab giving me ownership write on the project. problem solved.
Jan 9 2024
The steward subgroup is to support wikimedia stewards, and therefore not the correct place for our project. We decided to put it in ci-tools, even if it's probably a bit stretch :)
Dec 12 2023
So ya let's go with VCL!
+1
Dec 11 2023
Now output files dropped to 1k! 🎉
Dec 7 2023
Hi @brennen - I've been told you could the one to ask this question to: I'd like to create a new gitlab project for our global JVM POM file, reused globally at the foundation (therefore not under a team's name). I have identified the ci-tools subgroup and the stewards subgroup, and wondered if you thought the later would be good? Thanks
Dec 5 2023
You guys rock <3
I'm also eager to check if we run into parquet-decompression issues as I think could happen. Thanks a lot for running those experiments @xcollazo :)
If we start having data about which webrequest hits are prefetch or not, we definitely would be able to investigate! I'm in favor of moving fast and passing this header through as a new webrequest field. No change would be needd in Gobblin, only in wmf_raw.webrequest and wmf.webrequest schemas, as well as refine_webrequest hql to forward the field.
Thanks for the ping @Jdforrester-WMF. Data-engineering has not been using statsd as far as I know. We have helped the performance team in some of its usage if I recall correctly, it was work done with @Krinkle and Gilles, but we have not been maintaining or using statsd.
Let's talk and see how statsd used nowadays, as doc is old and talks about Graphite: https://wikitech.wikimedia.org/wiki/Graphite#Data_sources, https://wikitech.wikimedia.org/wiki/Statsd
Dec 4 2023
As discussed on Slack, I would suggest using dataframe.repartition(X) for the features datasets as data is relatively small and using coalesce impairs job scalability (and the number of files is far too big in comparison to the data size :).
Just added data-platform-SRE project to the list of projects. Ping @BTullis on this as well.
I think it can be closed, yes - the remaining tasks are almost all about Refine improvements we can now implement thanks to the move to Spark3.
Dec 1 2023
Thanks for the ping @mfossati - 79k files are still quite a lot - would you mind telling me more about the data? (size, partition-scheme etc)?
Nov 30 2023
@gmodena is working on adding data-quality metrics on the webrequest dataset (https://phabricator.wikimedia.org/T349763), and we added this one (duplicate map keys) in the list of things to have in the POC.
We shall have some data one of those days :)
Thanks folks :) This will make HDFS a lot happier <3
Nov 29 2023
Nov 27 2023
It's in my plan to update the docs @mpopov, it just takes longer than I would like (like everything else I do lately).
Relevant slack discussion: https://app.slack.com/client/E012JBDTTHA/CSV483812
Discussed during Data Engineering standup: let's fix with spark.sql.mapKeyDedupPolicy=LAST_WIN, and implement the metric monitoring how often it happens :)
Nov 24 2023
Thanks a lot @elukey for quickly troubleshooting <3
Nov 21 2023
I have added documentation here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Developer_guide#Concurrent_DAG_Runs
I think we can go and change our global config to use 1 as the default parallelisation of dag_runs per dag.
Nov 16 2023
ping @mpopov , @xcollazo , @Ottomata and @Milimetric :)
After giving it a few more thought, it seems that NOT changing the parameter globally to enforce data-correctness in folders is the best idea. We would document in wikitech and airflow the concurrency issue and give the solution of that setting for backfilling times.
I'm interested for feedback if anyone thinks it's not the best idea :)
Thanks a lot @BTullis - The problem you linked is indeed a known issue. We rely on hive-metastore and _SUCCESS files which should prevent the issue on prod jobs. For other jobs, this is indeed a problem.
I see 2 possible sultions here:
- We keep the change you made and see if it causes problems.
- We don't apply the change but provide a solution for people backfilling by telling them to use that parameter in their jobs.
I'm not sure which is best
That's great :)
Thanks a lot @BTullis :)
Oh hey, looks like we're not alone https://community.cloudera.com/t5/Support-Questions/How-to-change-Spark-temporary-directory-when-writing-data/m-p/237389 There appears to be a possible solution (a minor config change) which may fix this.
Nov 15 2023
This has been corrected - data should be ok now. Sorry for the inconvenience.
Super interesting finding!
Nov 13 2023
The document is here.
Thank you for @Hghani for this finding.
All unique-devices iceberg insertion jobs where incorrect.
The patch above is the correction. Once merged and deployed, I'll cleanup the tables.