Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (476 w, 14 h)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Yesterday

JAllemandou added a comment to T357859: Skip Wikidata when loading XML dumps to the Data Lake.

Thanks a lot @nshahquinn-wmf :)

Wed, Mar 27, 1:47 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights
JAllemandou added a comment to T356363: [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration.

Would we still want this integrated email functionality within refinery, when it's running under airflow?

Wed, Mar 27, 1:47 PM · Data-Engineering (Sprint 9), Patch-For-Review

Tue, Mar 26

JAllemandou added a comment to T359435: [Airflow] SparkSqlOperator fails when executing via Skein with master=local.

We currently have use-cases doing this exactly that work. there must have been another issue than the pone described here. I think this ticket is invalid.

Tue, Mar 26, 5:55 PM · Data-Engineering
JAllemandou renamed T360968: [Developer Experience] [SPIKE] Investigate process to automate deployment of folders and artifacts to HDFS from [Developer Experience] [SPIKE] Investigate process to automate deployment of hdfs artifacts to [Developer Experience] [SPIKE] Investigate process to automate deployment of folders and artifacts to HDFS.
Tue, Mar 26, 5:50 PM · Spike, Data-Engineering
JAllemandou added a comment to T357472: Add movement insights group/users to MWH denormalize job alerts.

Done using airflow variable mechanism.

Tue, Mar 26, 1:17 PM · Data-Engineering (Sprint 9), Movement-Insights, Data-Platform

Thu, Mar 21

JAllemandou added a comment to T358311: Check home/HDFS leftovers of goransm.

I have run our script to list user content in our various machines, the result is below.
@AndrewTavis_WMDE , I let you review and let us know when you have copied stuff you wish to keep, so that we can delete the rest.

Thu, Mar 21, 9:19 AM · Wikidata, Wikidata Analytics (Kanban), Data-Platform-SRE

Wed, Mar 6

JAllemandou renamed T359215: mediawiki_cirrussearch_request data is regularly late from mediawiki_cirrussearch_request refine job is regularly taking too long to run to mediawiki_cirrussearch_request data is regularly late.
Wed, Mar 6, 1:45 PM · Performance Issue, Data-Platform

Thu, Feb 29

JAllemandou added a comment to T347561: [Maintenance] Set up deletion jobs for Structured Data's data pipelines.

I think we're gonna use this ticket: https://phabricator.wikimedia.org/T262201

Thu, Feb 29, 10:22 AM · Data-Engineering (Sprint 9), Data Products, Structured-Data-Backlog
JAllemandou moved T345771: Adapt Sqoop to pagelinks schema change from In Review to Done on the Data-Engineering (Sprint 9) board.
Thu, Feb 29, 10:20 AM · Data-Engineering (Sprint 9), Data Products
JAllemandou changed the point value for T345771: Adapt Sqoop to pagelinks schema change from 8 to 3.
Thu, Feb 29, 8:32 AM · Data-Engineering (Sprint 9), Data Products
JAllemandou claimed T345771: Adapt Sqoop to pagelinks schema change.
Thu, Feb 29, 8:32 AM · Data-Engineering (Sprint 9), Data Products
JAllemandou moved T345771: Adapt Sqoop to pagelinks schema change from Next Up to In Review on the Data-Engineering (Sprint 9) board.
Thu, Feb 29, 8:32 AM · Data-Engineering (Sprint 9), Data Products
JAllemandou added a comment to T342911: Data Quality Issue: Wikitext History Job fail / rerun in Airflow.

Nothing done on my end - possibly one of the 2 jobs failed for real?

Thu, Feb 29, 7:31 AM · Data-Engineering (Q4 2024 April 1st - June 30th), Data Products, Movement-Metrics, Movement-Insights
JAllemandou added a comment to T355588: Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes.

Indeed, the job will not be affected with next month changes:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blame/main/analytics/dags/clickstream/clickstream_monthly_dag.py#L66
We'll need to keep looking for when those change though :)

Thu, Feb 29, 7:28 AM · Data-Engineering, Data Products

Wed, Feb 28

JAllemandou added a comment to T345771: Adapt Sqoop to pagelinks schema change.

We're gonna build a quickfix for next month sqoop to be successful (null values in dropped fields for some projects).

Wed, Feb 28, 3:55 PM · Data-Engineering (Sprint 9), Data Products
JAllemandou moved T357859: Skip Wikidata when loading XML dumps to the Data Lake from Ready to Deploy to Done on the Data-Engineering (Sprint 9) board.
Wed, Feb 28, 11:30 AM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights

Tue, Feb 27

JAllemandou moved T357859: Skip Wikidata when loading XML dumps to the Data Lake from In Review to Ready to Deploy on the Data-Engineering (Sprint 9) board.
Tue, Feb 27, 8:14 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights
JAllemandou moved T357859: Skip Wikidata when loading XML dumps to the Data Lake from In progress to In Review on the Data-Engineering (Sprint 9) board.
Tue, Feb 27, 5:36 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights
JAllemandou moved T356363: [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration from In progress to In Review on the Data-Engineering (Sprint 9) board.
Tue, Feb 27, 4:30 PM · Data-Engineering (Sprint 9), Patch-For-Review
JAllemandou moved T357859: Skip Wikidata when loading XML dumps to the Data Lake from Next Up to In progress on the Data-Engineering (Sprint 9) board.
Tue, Feb 27, 4:30 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights

Feb 22 2024

JAllemandou added a comment to T358196: [Presto] Use JWT authentication instead of Kerberos for cluster-internal communication.

I don't think this change would really affect queries performance, but I'm in favor of doing it for the benefit of relieving some pressure from Kerberos.

Feb 22 2024, 5:47 PM · Data-Platform-SRE, Data-Platform
JAllemandou updated subscribers of T358205: Investigate late/delayed Airflow task failure notifications.

Thank you for the thorough investigation @BTullis !
This example gives us more traction on the need to move toward goggle-groups instead of using mailman.
Let's see how this could be prioritized (ping @Ahoelzl and @Gehel :) )

Feb 22 2024, 1:01 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03), Data-Platform
JAllemandou moved T357419: Turn off ReportUpdater jobs no longer used from In Review to Done on the Data-Engineering (Sprint 9) board.
Feb 22 2024, 12:57 PM · Data-Engineering (Sprint 9)
JAllemandou renamed T358210: Delete reportupdater jobs data/puppet-code from Delete reportupdater jobs data to Delete reportupdater jobs data/puppet-code.
Feb 22 2024, 12:54 PM · Data-Engineering (Sprint 9)
JAllemandou created T358210: Delete reportupdater jobs data/puppet-code.
Feb 22 2024, 12:18 PM · Data-Engineering (Sprint 9)
JAllemandou added a comment to T354557: Dataset Config Store.

Have we looked around to see if there are existing 'dataset' config formats/specs we can already use?

Feb 22 2024, 12:15 PM · Epic, Data-Engineering

Feb 21 2024

JAllemandou moved T357419: Turn off ReportUpdater jobs no longer used from Next Up to In Review on the Data-Engineering (Sprint 9) board.
Feb 21 2024, 6:19 PM · Data-Engineering (Sprint 9)
JAllemandou claimed T357419: Turn off ReportUpdater jobs no longer used.
Feb 21 2024, 6:19 PM · Data-Engineering (Sprint 9)
JAllemandou added a comment to T357859: Skip Wikidata when loading XML dumps to the Data Lake.

Implementation plan:

Feb 21 2024, 6:12 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights
JAllemandou claimed T357859: Skip Wikidata when loading XML dumps to the Data Lake.
Feb 21 2024, 6:07 PM · Patch-For-Review, Movement-Metrics, Data-Engineering (Sprint 9), Data Products, Movement-Insights

Feb 12 2024

JAllemandou added a comment to T345771: Adapt Sqoop to pagelinks schema change.

Thank you so much @Ladsgroup for the recap.

Feb 12 2024, 3:37 PM · Data-Engineering (Sprint 9), Data Products

Feb 8 2024

JAllemandou updated subscribers of T345771: Adapt Sqoop to pagelinks schema change.

Hi @Ladsgroup,
I have a question for you: have all the projects been migrated to using the new linktarget table for the pagelinks table, even if their columns have not been removed?
I'm asking this for us to adapt our sqoop jobs, as we're starting to experience issues (only testwiki this month).

Feb 8 2024, 6:06 PM · Data-Engineering (Sprint 9), Data Products

Feb 7 2024

JAllemandou added a comment to T345771: Adapt Sqoop to pagelinks schema change.

This has started, testwiki schema has changed.
I'd also like to talk about https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L622 as the linktarget table is considerate private now.

Feb 7 2024, 4:12 PM · Data-Engineering (Sprint 9), Data Products
JAllemandou created T356866: [Data Quality] Update data_quality schemas to be compatible with Iceberg tables.
Feb 7 2024, 2:13 PM · Data-Engineering (Sprint 9)

Feb 5 2024

JAllemandou added a comment to T354692: [Data Quality] Implement basic data quality metrics for MW history.

Indeed! here is the code:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/MediawikiHistoryChecker.scala

Feb 5 2024, 6:04 PM · Patch-For-Review, Data-Engineering (Sprint 9)

Feb 1 2024

JAllemandou added a comment to T356400: User aqsloader hasn't MODIFY permissions on image_suggestions.* Cassandra tables anymore.

Data engineering team has written some code for our cassandra-loading jobs to be able to read a password from a file on HDFS:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/utils/WmfCassandraAuthConfFactory.scala

Feb 1 2024, 5:16 PM · Patch-For-Review, Discovery-Search (Current work), Structured-Data-Backlog (Current Work), User-Eevans, Cassandra, Data Products
JAllemandou added a comment to T324017: Set up Spark SQL Server.

While that could be useful, the spark-thrift server doesn't support user impersonation. The StackOverflow ticket I have read points to https://github.com/apache/kyuubi. We could investigate this.

Feb 1 2024, 4:20 PM · Data-Platform-SRE
JAllemandou moved T356363: [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration from Next Up to In progress on the Data-Engineering (Sprint 8) board.
Feb 1 2024, 10:31 AM · Data-Engineering (Sprint 9), Patch-For-Review
JAllemandou moved T352669: [Iceberg Migration] Migrate aqs hourly tables to Iceberg from Ready to Deploy to Done on the Data-Engineering (Sprint 8) board.
Feb 1 2024, 10:31 AM · Data-Engineering (Sprint 8)
JAllemandou moved T352670: [Iceberg Migration] Migrate browser_general tables to Iceberg from In Review to Done on the Data-Engineering (Sprint 8) board.
Feb 1 2024, 10:30 AM · Data-Engineering (Sprint 8)
JAllemandou moved T349743: NEW BUG REPORT 12 new wikis missing from the mediawiki_history dataset from In Review to Done on the Data-Engineering (Sprint 8) board.
Feb 1 2024, 10:30 AM · Data-Engineering (Sprint 8)

Jan 30 2024

JAllemandou added a comment to T356112: Generated Data Platform (neé AQS): remove (unused/uneeded) test_spark3_loading keyspace.

Go for it :)

Jan 30 2024, 9:25 AM · Cassandra

Jan 29 2024

JAllemandou added a comment to T347561: [Maintenance] Set up deletion jobs for Structured Data's data pipelines.

Thanks a log for not forgetting about this ticket @mfossati :)
the Data Engineering team is on the road toward providing you with (hopefully) an easy enough way to configure data deletion for your datasets.
In the meantime, manual deletion every now and then should be enough.
I don't think it's worth investing time on this before the new system comes in (probably a few months).
Is that ok for you?

Jan 29 2024, 2:31 PM · Data-Engineering (Sprint 9), Data Products, Structured-Data-Backlog
JAllemandou added a comment to T355920: DISCUSS: Relocate Generated Data Platform (neé AQS) test/dev tables?.

Does your tooling let you control size and throughput?

Jan 29 2024, 2:28 PM · Cassandra

Jan 26 2024

JAllemandou added a comment to T355920: DISCUSS: Relocate Generated Data Platform (neé AQS) test/dev tables?.

I think this is a good idea :)
The smaller size shouldn't be an issue as we should not test scalability but functions.

Jan 26 2024, 2:10 PM · Cassandra

Jan 24 2024

JAllemandou added a comment to T297944: Set up regular-repairs for AQS cassandra cluster tables.

The task is old but the objective is still valid IMO.
We should talk to @Eevans about this.

Jan 24 2024, 9:26 PM · Data-Engineering
JAllemandou closed T299961: Investigate Superset query templating as a mean to optimize partition pruning as Declined.

Closing as the strategy is to migrate to Iceberg.

Jan 24 2024, 8:46 PM · superset.wikimedia.org, Product-Analytics, Data-Engineering

Jan 23 2024

JAllemandou moved T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, from Blocked/Paused to Done on the Data-Engineering (Sprint 7) board.
Jan 23 2024, 2:54 PM · Data-Engineering (Sprint 7), Patch-For-Review, Java-Scala-Standardization, Discovery-Search, Data Pipelines

Jan 19 2024

JAllemandou triaged T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey as Unbreak Now! priority.
Jan 19 2024, 8:18 AM · Data-Engineering (Sprint 7)
JAllemandou created T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey.
Jan 19 2024, 8:18 AM · Data-Engineering (Sprint 7)

Jan 18 2024

dcausse awarded T355352: Users in archiva-deployer group can't upload artifacts anymore. a Love token.
Jan 18 2024, 6:42 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)
JAllemandou moved T354696: [Refine System] Define a concept and an approach for refactoring the Refine system from Next Up to In Review on the Data-Engineering (Sprint 7) board.
Jan 18 2024, 6:30 PM · Data-Engineering (Sprint 7)
JAllemandou moved T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, from In Review to Blocked/Paused on the Data-Engineering (Sprint 7) board.
Jan 18 2024, 6:30 PM · Data-Engineering (Sprint 7), Patch-For-Review, Java-Scala-Standardization, Discovery-Search, Data Pipelines
JAllemandou added a comment to T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom,.

Blocked on https://phabricator.wikimedia.org/T355352

Jan 18 2024, 6:30 PM · Data-Engineering (Sprint 7), Patch-For-Review, Java-Scala-Standardization, Discovery-Search, Data Pipelines
JAllemandou moved T355352: Users in archiva-deployer group can't upload artifacts anymore. from Next Up to Radar (External Teams) on the Data-Engineering (Sprint 7) board.
Jan 18 2024, 6:30 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)
JAllemandou updated the task description for T355352: Users in archiva-deployer group can't upload artifacts anymore..
Jan 18 2024, 6:29 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)
JAllemandou updated the task description for T355352: Users in archiva-deployer group can't upload artifacts anymore..
Jan 18 2024, 6:29 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)
JAllemandou created T355352: Users in archiva-deployer group can't upload artifacts anymore..
Jan 18 2024, 6:29 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)

Jan 11 2024

JAllemandou moved T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, from In progress to In Review on the Data-Engineering (Sprint 7) board.
Jan 11 2024, 6:59 PM · Data-Engineering (Sprint 7), Patch-For-Review, Java-Scala-Standardization, Discovery-Search, Data Pipelines
JAllemandou renamed T354803: Give user joal the right to create branches in the `wmf-jvm-parent-pom` and `wmf-maven-tool-configs` gitlab projects from Give user joal the right to create branches in the `wmf-jvm-parent-pom` gitlab project to Give user joal the right to create branches in the `wmf-jvm-parent-pom` and `wmf-maven-tool-configs` gitlab projects.
Jan 11 2024, 1:34 PM · GitLab (Auth & Access), Release-Engineering-Team

Jan 10 2024

JAllemandou closed T354803: Give user joal the right to create branches in the `wmf-jvm-parent-pom` and `wmf-maven-tool-configs` gitlab projects as Resolved.

@brennen has updated my rights on gitlab giving me ownership write on the project. problem solved.

Jan 10 2024, 9:14 PM · GitLab (Auth & Access), Release-Engineering-Team
JAllemandou created T354803: Give user joal the right to create branches in the `wmf-jvm-parent-pom` and `wmf-maven-tool-configs` gitlab projects.
Jan 10 2024, 9:09 PM · GitLab (Auth & Access), Release-Engineering-Team

Jan 9 2024

JAllemandou added a comment to T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom,.

The steward subgroup is to support wikimedia stewards, and therefore not the correct place for our project. We decided to put it in ci-tools, even if it's probably a bit stretch :)

Jan 9 2024, 1:58 PM · Data-Engineering (Sprint 7), Patch-For-Review, Java-Scala-Standardization, Discovery-Search, Data Pipelines

Dec 12 2023

JAllemandou added a comment to T346463: Identify and label prefetch proxy data in our traffic.

So ya let's go with VCL!

+1

Dec 12 2023, 2:01 PM · Traffic, Movement-Insights, Data-Engineering

Dec 11 2023

JAllemandou added a comment to T350009: Coalesce SEAL output.

Now output files dropped to 1k! 🎉

Dec 11 2023, 6:17 PM · Structured-Data-Backlog (Current Work), Image-Suggestions

Dec 7 2023

JAllemandou updated subscribers of T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom,.

Hi @brennen - I've been told you could the one to ask this question to: I'd like to create a new gitlab project for our global JVM POM file, reused globally at the foundation (therefore not under a team's name). I have identified the ci-tools subgroup and the stewards subgroup, and wondered if you thought the later would be good? Thanks

Dec 7 2023, 5:42 PM · Data-Engineering (Sprint 7), Patch-For-Review, Java-Scala-Standardization, Discovery-Search, Data Pipelines

Dec 5 2023

JAllemandou added a comment to T352577: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool().

You guys rock <3

Dec 5 2023, 8:28 PM · Data-Platform-SRE, Data-Engineering, Data Products
JAllemandou added a comment to T340863: Mechanism for error logging when doing MERGE INTO.

I'm also eager to check if we run into parquet-decompression issues as I think could happen. Thanks a lot for running those experiments @xcollazo :)

Dec 5 2023, 8:27 PM · Data Products (Data Products Sprint 05), Patch-For-Review, Dumps 2.0
JAllemandou updated subscribers of T346463: Identify and label prefetch proxy data in our traffic.

@JAllemandou how complex are the changes? Is it a quick patch to get in or do we need more discussion?

Dec 5 2023, 5:22 PM · Traffic, Movement-Insights, Data-Engineering
JAllemandou added a comment to T346463: Identify and label prefetch proxy data in our traffic.

If we start having data about which webrequest hits are prefetch or not, we definitely would be able to investigate! I'm in favor of moving fast and passing this header through as a new webrequest field. No change would be needd in Gobblin, only in wmf_raw.webrequest and wmf.webrequest schemas, as well as refine_webrequest hql to forward the field.

Dec 5 2023, 2:24 PM · Traffic, Movement-Insights, Data-Engineering
JAllemandou awarded T350106: Implement a spark job that converts a RDF triples table into a RDF file format a Burninate token.
Dec 5 2023, 8:39 AM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
JAllemandou updated subscribers of T326386: Use of "self" in callables is deprecated in php8.2 from liuggio/statsd-php-client package.

Thanks for the ping @Jdforrester-WMF. Data-engineering has not been using statsd as far as I know. We have helped the performance team in some of its usage if I recall correctly, it was work done with @Krinkle and Gilles, but we have not been maintaining or using statsd.
Let's talk and see how statsd used nowadays, as doc is old and talks about Graphite: https://wikitech.wikimedia.org/wiki/Graphite#Data_sources, https://wikitech.wikimedia.org/wiki/Statsd

Dec 5 2023, 8:37 AM · MediaWiki-Platform-Team, MediaWiki-libs-Stats, PHP 8.2 support, Upstream, MediaWiki-Vendor

Dec 4 2023

JAllemandou added a comment to T350009: Coalesce SEAL output.

As discussed on Slack, I would suggest using dataframe.repartition(X) for the features datasets as data is relatively small and using coalesce impairs job scalability (and the number of files is far too big in comparison to the data size :).

Dec 4 2023, 4:20 PM · Structured-Data-Backlog (Current Work), Image-Suggestions
JAllemandou updated subscribers of T352577: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool().

Just added data-platform-SRE project to the list of projects. Ping @BTullis on this as well.

Dec 4 2023, 4:02 PM · Data-Platform-SRE, Data-Engineering, Data Products
JAllemandou added a project to T352577: [blocker] Airflow unittests failing with TypeError: Pool.create_or_update_pool(): Data-Platform-SRE.
Dec 4 2023, 4:01 PM · Data-Platform-SRE, Data-Engineering, Data Products
JAllemandou added a comment to T291464: Upgrade analytics-hadoop to Spark 3 + scala 2.12.

I think it can be closed, yes - the remaining tasks are almost all about Refine improvements we can now implement thanks to the move to Spark3.

Dec 4 2023, 10:58 AM · Epic, Data-Engineering
JAllemandou awarded T352534: [airflow] Inserting task notes is not working since upgrade to version 2.7.3 a Pirate Logo token.
Dec 4 2023, 10:55 AM · Data-Platform-SRE (2023.12.01 - 2023.12.31)

Dec 1 2023

JAllemandou added a comment to T350009: Coalesce SEAL output.

Thanks for the ping @mfossati - 79k files are still quite a lot - would you mind telling me more about the data? (size, partition-scheme etc)?

Dec 1 2023, 1:29 PM · Structured-Data-Backlog (Current Work), Image-Suggestions

Nov 30 2023

JAllemandou added a comment to T351909: Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest.

@gmodena is working on adding data-quality metrics on the webrequest dataset (https://phabricator.wikimedia.org/T349763), and we added this one (duplicate map keys) in the list of things to have in the POC.
We shall have some data one of those days :)

Nov 30 2023, 6:15 PM · Data Products (Data Products Sprint 04), Data-Engineering
JAllemandou added a comment to T347558: [S] Coalesce section alignment image suggestions output.

Thanks folks :) This will make HDFS a lot happier <3

Nov 30 2023, 2:00 PM · Data-Engineering (Sprint 6), Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions

Nov 29 2023

JAllemandou moved T309097: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, from Next Up to In progress on the Data-Engineering (Sprint 5) board.
Nov 29 2023, 2:08 PM · Data-Engineering (Sprint 7), Patch-For-Review, Java-Scala-Standardization, Discovery-Search, Data Pipelines
JAllemandou moved T349746: [Data Platform] Document proposal for data-product configuration store from In progress to In Review on the Data-Engineering (Sprint 5) board.
Nov 29 2023, 1:37 PM · Data-Engineering (Sprint 5)

Nov 27 2023

JAllemandou added a comment to T351388: Add a spark global config for better file commit strategy.

It's in my plan to update the docs @mpopov, it just takes longer than I would like (like everything else I do lately).

Nov 27 2023, 4:18 PM · Data-Engineering (Sprint 5), Data-Platform-SRE
JAllemandou added a comment to T311743: [Iceberg] Epic: Icebergify event_sanitized database.

Relevant slack discussion: https://app.slack.com/client/E012JBDTTHA/CSV483812

Nov 27 2023, 4:09 PM · Data-Engineering, Data Pipelines, Epic
JAllemandou added a comment to T351909: Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest.

Discussed during Data Engineering standup: let's fix with spark.sql.mapKeyDedupPolicy=LAST_WIN, and implement the metric monitoring how often it happens :)

Nov 27 2023, 3:32 PM · Data Products (Data Products Sprint 04), Data-Engineering

Nov 24 2023

JAllemandou added a comment to T351731: Turnilo: invalid transforms on wmf_netflow dashboard.

Thanks a lot @elukey for quickly troubleshooting <3

Nov 24 2023, 12:04 PM · Data-Engineering

Nov 21 2023

JAllemandou added a comment to T351388: Add a spark global config for better file commit strategy.

I have added documentation here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Developer_guide#Concurrent_DAG_Runs
I think we can go and change our global config to use 1 as the default parallelisation of dag_runs per dag.

Nov 21 2023, 1:23 PM · Data-Engineering (Sprint 5), Data-Platform-SRE

Nov 16 2023

JAllemandou updated subscribers of T351388: Add a spark global config for better file commit strategy.

ping @mpopov , @xcollazo , @Ottomata and @Milimetric :)

Nov 16 2023, 5:57 PM · Data-Engineering (Sprint 5), Data-Platform-SRE
JAllemandou added a comment to T351388: Add a spark global config for better file commit strategy.

After giving it a few more thought, it seems that NOT changing the parameter globally to enforce data-correctness in folders is the best idea. We would document in wikitech and airflow the concurrency issue and give the solution of that setting for backfilling times.
I'm interested for feedback if anyone thinks it's not the best idea :)

Nov 16 2023, 5:55 PM · Data-Engineering (Sprint 5), Data-Platform-SRE
JAllemandou added a comment to T351388: Add a spark global config for better file commit strategy.

Thanks a lot @BTullis - The problem you linked is indeed a known issue. We rely on hive-metastore and _SUCCESS files which should prevent the issue on prod jobs. For other jobs, this is indeed a problem.
I see 2 possible sultions here:

  • We keep the change you made and see if it causes problems.
  • We don't apply the change but provide a solution for people backfilling by telling them to use that parameter in their jobs.

I'm not sure which is best

Nov 16 2023, 5:28 PM · Data-Engineering (Sprint 5), Data-Platform-SRE
JAllemandou added a comment to T349523: Update spark warehouse configuration to use the same as Hive.

I used the short for of the URI - e.g. hdfs:///user/hive/warehouse in the spark3-defaults file, which I believe should work on both the staging and production clusters due to our use of dfs.DefaultFS in /etc/hadoop/conf/core-site.xml

That's great :)
Thanks a lot @BTullis :)

Nov 16 2023, 2:31 PM · Data-Platform-SRE
JAllemandou added a comment to T347076: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist..

Oh hey, looks like we're not alone https://community.cloudera.com/t5/Support-Questions/How-to-change-Spark-temporary-directory-when-writing-data/m-p/237389 There appears to be a possible solution (a minor config change) which may fix this.

Nov 16 2023, 11:21 AM · Data-Engineering
JAllemandou created T351388: Add a spark global config for better file commit strategy.
Nov 16 2023, 11:21 AM · Data-Engineering (Sprint 5), Data-Platform-SRE
JAllemandou added a comment to T346646: Move wmf_dumps.wikitext_rc1 to the correct HDFS directory.
Nov 16 2023, 8:02 AM · Data Products (Data Products Sprint 04), Dumps 2.0

Nov 15 2023

JAllemandou moved T350920: Iceberg unique devices table reporting incorrect numbers for 2023-10-01 from In Review to Done on the Data-Engineering (Sprint 5) board.
Nov 15 2023, 5:25 PM · Data-Engineering (Sprint 5), Movement-Insights, Data-Platform
JAllemandou added a comment to T350920: Iceberg unique devices table reporting incorrect numbers for 2023-10-01.

This has been corrected - data should be ok now. Sorry for the inconvenience.

Nov 15 2023, 5:25 PM · Data-Engineering (Sprint 5), Movement-Insights, Data-Platform
JAllemandou added a comment to T347076: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist..

Super interesting finding!

Nov 15 2023, 5:24 PM · Data-Engineering

Nov 13 2023

JAllemandou added a comment to T349746: [Data Platform] Document proposal for data-product configuration store.

The document is here.

Nov 13 2023, 11:08 AM · Data-Engineering (Sprint 5)
JAllemandou moved T350920: Iceberg unique devices table reporting incorrect numbers for 2023-10-01 from Next Up to In Review on the Data-Engineering (Sprint 5) board.
Nov 13 2023, 11:07 AM · Data-Engineering (Sprint 5), Movement-Insights, Data-Platform
JAllemandou added a comment to T350920: Iceberg unique devices table reporting incorrect numbers for 2023-10-01.

Thank you for @Hghani for this finding.
All unique-devices iceberg insertion jobs where incorrect.
The patch above is the correction. Once merged and deployed, I'll cleanup the tables.

Nov 13 2023, 11:06 AM · Data-Engineering (Sprint 5), Movement-Insights, Data-Platform