https://commons.wikimedia.org/wiki/File:D%C3%BClmen,_Merfeld,_Sonnenuntergang_--_2012_--_3264.jpg

Outreachy round 21: experiences and outcomes

Wikimedia’s participation in Outreachy Round 21 focused on projects related to data science and engineering. In this post, the interns share the outcomes and experiences of their projects.

Edited by Sarah R. Rodlund and Srishti Sethi

Outreachy round 21 group photo, Pavithraes, CC BY-SA 4.0

Outreachy is a diversity initiative that provides paid, remote internships to people subject to systemic bias and impacted by underrepresentation in the technical industry where they are living.” Learn more here: https://www.outreachy.org/ 

At Wikimedia, we ran the Outreachy Round 21 a bit differently and encouraged projects related to data science and engineering. With the focus on a particular theme, the intention was to provide opportunities for interns to interact with and support each other and have a fulfilling experience. For this, we made extra outreach efforts to reach mentors and potential candidates with a background in data science. Wikimedia received over 28 applications and accepted 7 interns to work on projects (5 were data-science related) in this round. We interviewed our interns to gather their experiences. We asked what the process like for them, about their successes and challenges, and asked them for advice they may have for future applicants.

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Link to project (↪)

KayYen Wong (User:0xkaywong)

The quality and reliability of Wikipedia content are maintained by a community of volunteer editors. Machine learning and information retrieval algorithms could help scale up editors’ manual efforts around Wikipedia content reliability. However, there is a lack of large-scale data to support the development of such research. To fill this gap, we released Wiki-Reliability, the first large-scale machine learning dataset of English Wikipedia articles annotated with a wide set of content reliability issues. 

This involved researching, gathering, and processing Wikipedia related data about the content reliability of articles, detecting crowd-generated tags or labels currently used by the Wikipedia editors and developers to signal problems with content integrity on Wikipedia to other editors, to create machine-readable datasets that will allow ML systems to detect problematic content potentially automatically. During the project, we also benchmarked the quality of the datasets by running different ML algorithms, which would act as baselines for future researchers.

Why did you decide to participate in Outreachy?

Prior to applying for Outreachy, I had a background as a machine learning engineer and had research experience in natural language processing. I’ve always been passionate about applying tech for social good, and had some experience with civic technology, from my internship at a Malaysian Civic Tech NGO. This formed much of the foundation for my interest in getting into open-source, which I believe at its core shares much of the same values of social computing and technology for the greater good.

I was excited when I found out about the Outreachy program, which aims to support diversity (especially for underrepresented communities) in open-source and free software. I’ve always been curious about open-source but was never really sure where to start. I found the idea of contributing towards FOSS projects on Github intimidating. The idea of meddling with a larger codebase and having to bear the judgment of impersonal reviewers formed something of a personal (and perhaps illogical) barrier for me.

It’s why the Outreachy program, with the opportunity of more personal guidance from experienced mentors in OSS, appealed to me. What pushed me to finally apply was finding out Wikimedia’s available projects during Round 21. I’ve always aligned with Wikimedia’s values of free knowledge, open-source, and transparency. The opportunity to work towards these values through my Outreachy internship with Wikimedia is something I’m truly grateful for.

What was the process like? 

I found that the application stage for the Outreachy program helped to set aside a lot of my anxieties in getting into open-source. It can be daunting to make that first push (or pull (request)). Fortunately, just the application period itself, which required that applicants make contributions to project repositories, served to provide the momentum for me to put my anxiety behind me and make some contributions. 

I was fortunate to have the opportunity to work under the mentorship and support of my wonderful mentors Miriam Redi and Diego Saez-Trumper throughout my internship. They have always been very responsive in providing me with help both on the technical front, and with invaluable guidance in helping me grow throughout my open-source journey. A minor part of my experience that I feel may have differed from other Outreachy interns, is that although I did contribute code and push it to a repository, I never made any pull requests or changes to an existing repository as one might expect during an open-source project. That’s because I was working with Wikimedia Research on a more research-oriented project. This was something a bit surprising to me, but it goes to show the diverse number of ways one could be involved in open-source!

Since starting on the project, I’ve learned a lot about the Wikimedia ecosystem and how it pertains to the goals of the project. I’ve learned a lot not only from the technical side of working with Wikipedia’s tools and APIs but also gained a deep appreciation of the Wikipedia community, their rich set of community-governed policies and guidelines, and how they inform the inner workings of Wikipedia! 

Can you share some of your challenges and/or successes?

A challenge I faced in my project was definitely in changing expectations in my project timeline.

My initial timeline proposal divided the three main phases of data processing, data analysis, and model benchmarking equally across the three months of my internship. Looking back on my proposed timeline which I submitted for my internship application, I realized I had completely underestimated the amount of time it would take to complete the first phase of the project. Going into the project, I was sure that I would be done with the first phase within three weeks, and would be able to get started on the interesting data analysis and modeling part in no time. It only became clear to me (as I progressed along with the project, and after numerous meetings with my mentors) that the first phase was actually the main goal and deliverable of the project, and that the task at hand was more complicated than I gave it credit for. 

I had underestimated the importance of certain aspects of a task and overestimated others. It took learning more about the task itself to actually understand its complexities and where to focus my efforts. Fortunately, even though I was behind on my initial estimate of the project phases, I was still on track to obtaining the main deliverables by the end of my internship. It’s because the first phase of my project, which I ended up spending more time on, was warranted for the additional time I did spend on it, while the remaining phases, which I thought were more important, didn’t require as much of a time commitment.

Such things are difficult to get a feel for until you actually dig your toes further into the project and gain more experience. This is especially difficult because Outreachy interns only have three months to work on their project (and those three months pass by sooner than you’d expect). The main takeaway is that one should be prepared to manage their expectations in regards to changes in timeline and be prepared to flexibly deal with any unexpected changes.

Do you have any advice for others who are thinking of participating in Outreachy?

I strongly encourage future applicants to just apply for the program, and most importantly, CONTRIBUTE (!!!) as even the contributions period is beneficial in itself for getting familiarised with open-source. I learned a lot just from the contributions period itself– about the project, and about how to submit pull requests. Even if you don’t get in, the contributions period provides some much needed momentum to get started as a contributor!

Besides that, I also encourage participants to be active during the contributions period, if possible. I was active in the task forums and would try my best to help answer any questions from other participants, which I felt helped me get noticed!

Is there anything else you would like to share with us?

Our Wiki-Reliability research was recently accepted as a resource track paper in the SIGIR’2021 research conference!

Schema of the Wiki-reliability metadata datasets screenshot, KayYen Wong, CC BY-SA 4.0

Build a tool for inferring what countries are associated with a given Wikipedia article

Link to project (↪)

Jesse Amamgbu

The project answers the question about the relevance of a country in a Wikipedia article. It is aimed at helping editors identify Wikipedia articles relevant to a certain country/region and this, in turn, would help them improve these articles. 

Why did you decide to participate in Outreachy?

I was looking for opportunities to get started in open-source, especially in data science. I have always wanted to make an impact with data but not a lot of opportunities presented themselves especially in relation to data science.

What was the process like? 

I made sure to apply as early as I could so I would have ample time to review my applications and tweak my application. Once I got past the initial applications, I made sure to contact previous alums to learn about their experience and what made them stand out as I felt so lost initially. As time went on, I had a better grasp of the project and also tried to help others out and I guess this played a role in my selection as an Outreachy intern working with Wikimedia Foundation.

The mentors I had were fantastic. Communication was so easy. I am naturally introverted, but the friendly nature of my mentors helped me open up to some issues I faced during the internship. Working on the network inference model was really challenging. There were days I doubted myself but with help from mentors and self-determination, I was able to complete the requirements for the project.

I also got to learn new technologies. I got to learn Vue JS with the help of my mentors and was able to reconstruct the inference tool UI from HTML/CSS/Jquery to Vue JS. 

Can you share some of your challenges and/or successes?:

While working on the network inference model, I faced high memory consumption and latency issues while building this model. This got me frustrated at some point and I started to have Impostor syndrome. But I remembered that “no one is an island” and Mentors were put in the Outreachy program to assist us in situations like this. Through advice from my Mentor, Isaac, I was able to apply a modified version of the Hoffman coding approach to sort out this problem and recorded over 200% reduction in memory consumption as well as faster computations.

Do you have any advice for others who are thinking of participating in Outreachy?

I would advise anyone thinking of participating in Outreachy to come with the mindset to learn. Do not be scared of failure. It is one of the sure ways you can develop yourself, A failure is just an idea that did not work out as planned. The beauty of it is your ability to go back to your drawing board and look deeply into why that idea did not work and that should put you on your path to success. 

Is there anything else you would like to share with us?:

Outreachy made me fall in love with open-source. I am eager to contribute as a data scientist with more open-source data. This has really been a wild but exciting experience.

Article classification screenshot, Jesse Amamgbu, CC BY-SA 4.0
The country relevance of “La Pintana,” screenshot, Jesse Amamgbu, CC BY-SA 4.0

Evaluate Microsoft Playwright as a replacement for our browser automation

Link to project (↪)

Harriet Ayugi

The Playwright evaluation is a result of a need to check if WebdriverIO is still a good test automation framework compared to some of the best non-Selenium modern test automation frameworks. MediaWiki is implemented in a large number of repositories. So, in order to ensure good code practices across all these repositories, an extensive amount of testing is performed. One of the tests performed is an end-to-end test. WebdriverIO is the current browser automation framework being used for implementing end-to-end tests. However, with the recent increase in awareness about end-to-end testing, a number of equally competitive non-Selenium solutions have been introduced and one of them is Microsoft Playwright.

Why did you decide to participate in Outreachy?

Outreachy offers a great opportunity and below are a couple of reasons for my participation:

  • The opportunity to work on hands-on projects while learning.
  • Outreachy gives a platform of exposure whereby one is able to meet and connect with many software developers worldwide.

What was the process like?

Generally, the process was really great: 

  •  Being able to contribute to various open-source projects during the contribution phase was a time of exploration and learning too. 
  • The mentors are so understanding and patient when dealing with interns which gives room for learning and making mistakes while delivering work.
  • The project was a bit challenging at the start because there were a lot of different tools to set up but with the help of my mentors, it was balanced.

Can you share some of your challenges and/or successes?

Successes:

  • Being able to meet the miles stones within the 3 months period of the internship.
  • Learning about automation testing using Microsoft Playwright and WebdriverIO, and tools like Github actions among others.
  • Received valuable Career advice from my mentors and the Outreachy organizers/communities.

Challenges:

  • Setting up the project at the start was a bit of a challenge.

Do you have any advice for others who are thinking of participating in Outreachy?

I do encourage anyone who wants to participate in Outreachy to always try to contribute to projects that may seem hard, there is always room to learn and rise above the challenges.

Is there anything else you would like to share with us?

I would like to thank the Outreachy organizers and communities for the program. It has been a great opportunity to advance my software developer career through making contributions to open-source projects and creating long-lasting connections.

Review and improve Lua documentation on meta and MediaWiki

Link to project (↪)

Ogechi Vivian Okey (User: Gechy)

Lua is a scripting language supported in all Wikimedia Foundation sites (since March 2013), via the Scribunto extension. The Lua project aims to make it possible for MediaWiki end-users to use a proper scripting language that will be more powerful and efficient than ad-hoc ParserFunctions-based logic. Complex templates and ParserFunctions cause a lot of performance (some pages are overloaded with templates and require 40 seconds or more to parse/render) and bottlenecks.

 The Lua documentation that exists on meta-wiki and mediawiki.org needs some improvement. I  worked on improving the documentation to explain what Lua is, why Lua, and getting started with Lua on media-wiki.

The project was aimed at providing well-organized and up-to-date documentation for contributors/volunteers and getting them started with Lua on Mediawiki.

Why did you decide to participate in Outreachy?

I have always been enthusiastic about contributing to the open-source community. When I found out about the Outreachy program it was a great opportunity to get started with my journey to contributing to open-source projects.

What was the process like?

I must say success is a journey, not a destination. The process was made up of three interesting phases:

The application phase

It all started with the application phase, it was all about writing some short essays on the application form and submitting them for screening. I was super excited when I got the mail that I was shortlisted to proceed to the contribution phase.

The contribution phase

We had a list of projects and organizations to work with during the contribution phase to last for a period of one month. I went through the list of projects and identified two projects I would love to work on. I settled for the Wikimedia Foundation due to my interest in scripting and automation. Improving the documentation for Lua was an opportunity to learn more about Lua scripting language. I got familiar with tools like Gerrit, Phabricator, editing a wiki page, Zulip which was used for collaboration during this phase. At the end of the Contribution stage, I submitted a proposal for “Review and improve Lua documentation on meta and MediaWiki.”

The acceptance phase

On 24th November 2020, I received the acceptance mail as an intern to work on the project under the mentorship of Pavithra and Doug they made the onboarding and getting started so seamless and fun. There were a  lot of resources and follow-up sessions to get me up to speed with the task from my mentors. As an intern, I got some blog prompts at intervals to publish and regular check-ins with mentors to get feedback on the work done so far. It was an exciting three months with a great learning curve.

Can you share some of your challenges and/or successes?

As a newbie to technical writing, it comes with its own challenges and struggles, trying to simplify and provide explanations and adequate references for my users. I had a couple of challenges ranging from getting started with editing on the wiki, an overview of Lua scripting, and drafting the content in order not to put too much information than needed on the introductory page. These blockers were shared with my mentors and a follow-up session was set to walk me through the process of overcoming the issues at hand.

Do you have any advice for others who are thinking of participating in Outreachy?

For every milestone, there is always a starting point, it comes with its own challenges and struggles. It’s okay to feel stuck at some point while achieving the set goals and tasks because everyone struggles, including experts. Hey! I am no exception to this rule. It doesn’t matter what other people think or say, struggling is part of life. For there is no better time than now, it’s a great opportunity and I would love you to be part of this great community.

Is there anything else you would like to share with us?

It was an amazing journey. I would like to say a Big Thank You to the Team behind the Outreachy without you this wouldn’t have been possible ….Kudos

Links to work

Analyzing community authored function to help find important and similar modules for Abstract Wikipedia

Link to project (↪)

Aisha Khatun and Liudmila (Jade) Kalina

Abstract Wikipedia is a project in development with the bold aim of making Wikipedia language independent. With knowledge stored in a language-independent manner, it was also required that community-authored functions (aka Modules) be also centralized. With this in mind our task was to analyze all modules across all wiki projects, across all languages to find:

  1. What modules are important (more used, more edited, etc.)
  2. What modules seem to perform similar work across wikis.

By identifying important modules, we can start the process of centralizing these modules. And by finding similar modules editors can start merging modules across wiki projects that perform similar tasks. 

For this project we built a data pipeline to fetch all the modules (Lua codes) plus various information about them, like how many pages is a module being used in, are they edited a lot? etc. Then we performed a heuristic-based statistical approach to find important modules from the distribution of these features. We used a machine learning based unsupervised clustering algorithm to identify similar modules. 

Lastly, we built a web interface to access the results https://abstract-wiki-ds.toolforge.org/. In this interface, we can choose the importance of various fields, and click submit to generate a list of modules that are important to us. And on clicking the individual page link, a list of similar modules is displayed.

Why did you decide to participate in Outreachy? 

Outreachy supports us, the underrepresented community, and I was very happy to join the larger more diverse group of individuals from all around the world. I am always attracted to a sense of comfort at work, and Outreachy’s diversity provided just that. Besides, Outreachy hosts amazing open-source projects, some very related to my field of interest, this was the best way to get involved with open-source!

What was the process like? 

I had heard about Outreachy earlier but did not dare to apply. I was afraid of open-source and underestimated my skills. This time, I applied and decided to just get a taste of it. I found a couple of data science and ML-related projects, and I couldn’t stop myself from contributing. After my first feedback in the contribution phase, I got more confident. The mentors were really helpful and really took the time to review my work. 

Getting selected was a moment of shock and elation at the same time. Throughout my three-month internship, I regained all my lost confidence and it even grew. My mentor would always praise our hard work, and there was always a very good, supportive environment, which made me push even harder. Outreachy not only helped me break into open-source but also gave me huge confidence and skills boost. Thanks to this, I later got a data analyst job with Wikimedia Foundation as well!

Can you share some of your challenges and/or successes?

Getting started is always a challenge. Getting used to the workflow, the community, and how to do simple stuff. It felt awkward at first to ask for help, but later, things got very smooth. It was also challenging to figure out solutions for certain aspects of the project on my own, and I was afraid it wouldn’t be good enough. But turns out it was good after all! I talked about it in calls, wrote documentation on it, and it was very well received.

Do you have any advice for others who are thinking of participating in Outreachy?

Apply to Outreachy with an open mind, do not judge your own skills. Just start contributing, you will be amazed at the help and support you get. We often give up on opportunities for a false sense of being under-qualified. Notice how Outreachy does not ask for your resume or take any interviews. Your work matters; that’s all!

Links to related work

Image content filtration tool

Link to project (↪)

Harshinee Sriram

In an attempt to counter vandalism attacks that are routinely subjected to Wikipedia pages and Wikimedia Commons, the content filtration tool determines how safe (or unsafe) a particular user-submitted image is. 

Why did you decide to participate in Outreachy?

What makes Outreachy stand out is its commitment to providing a unique opportunity for those who have been historically underrepresented or oppressed. Personally, I applied to Outreachy to gain some real technical experience in the domain I’m interested in—machine learning.

What was the process like? 

The application process was quite straightforward. The fact that there is a filter round after the initial application submission and those selected are encouraged to contribute to the project for a month ensures that applicants truly understand the job duties, requirements, and commitments that would be expected from a full-time intern. 

I was very fortunate to have mentors who were understanding and weren’t dismissive of the ideas that I’d like to explore. During my internship, I had to take a leave for about two weeks, and I also had to shift from India to Canada, which caused some delays in the deliverables, but my mentors were supportive throughout the way.

Contributing to the project was wonderful because I was able to understand how a project lifecycle works in practice. This involved addressing community concerns, discussing available deployment pathways with other senior employees, reading about current research in model compression and attempting to implement them, writing code that adheres to a set of guidelines and practices, etc.

Can you share some of your challenges and/or successes?

As for any machine learning model, the soul is the data. And the problem with gathering data is that unstructured, raw data is cheaper (lesser effort to obtain) than efficiently categorized and stored data. Some of our initial (and current) challenges revolve around the limitations of the dataset the model is trained on. Also, the previously determined deployment plan didn’t succeed because of the inherent high computational expense of deep learning models. 

However, the successes are plentiful. Not only were we able to develop a very lightweight and robust model, but our model also performed better than Yahoo’s Open NSFW system on a variety of test images, while being considerably faster and smaller in size.

Do you have any advice for others who are thinking of participating in Outreachy?

Document all your contributions during the contribution stage! Did you attempt to test a new algorithm on some data you collected? Write about your findings and what you’ll do next, and submit this as a contribution! It’s always better to be a more engaged intern who learns from their mistakes and lets the mentors know that they’re working as actively as possible. This is because only with frequent, periodic engagements will you know how to approach the problems when the time arrives. 

Is there anything else you would like to share with us? 

Nothing much. Just that my experience was very positive and I’m glad to have contributed to the Wikimedia Foundation.

Thank you, everyone!

We want to thank our organization administrators: Gopa Vasanth and Pavithra Eswaramoorthy, for their tremendous support to interns throughout the internship and our mentors: Adam Baso, Isaac Johnson, Daniyal Abbasi, Chaitanya Mittal, Diego Saez-Trumper, Željko Filipin, Vidhi Mody, Soham Parekh, Pavithra Eswaramoorthy and Doug Taylor for their immense guidance to our interns for working on projects. 

Applications for the next round of Outreachy will open in August for the December 2021 cohort. You can sign up for their mailing list or check back in August on the website for detailed information: https://www.outreachy.org/.

About this post

Featured image credit: Dietmar Rabich / Wikimedia Commons / “Dülmen, Merfeld, Sonnenuntergang — 2012 — 3264” / CC BY-SA 4.0

Leave a Reply

Your email address will not be published. Required fields are marked *