Page MenuHomePhabricator

Performance review of IP Info extension
Closed, ResolvedPublic

Description

Description

Basic description

A new extension for providing information about an IP address without (1) the need for the user to use an external service themselves and (2) exposing the IP address itself to the user. (Actually hiding the IP addresses is beyond the scope of this project.)

This provides the user with information they could have retrieved from knowing the IP address. This information could be displayed in various ways (popup on hover/click, special page, etc.).

Data

Based on our investigation in T259726, the data our users are looking for is not accessible from freely licensed datasets alone. Ideally IPInfo would combine several datasets, but this depends on agreeing licences with different providers, and will only be considered in future iterations.

The first iteration of IPInfo will use only MaxMind's GeoIP2 databases, which we already have licences for, and which are already available on our servers: T263263#6534392. A PHP package providing an API for these databases is undergoing a security readiness review: T262963.

For third parties who do not have access to the proprietary datasets, IPInfo will use MaxMind's free GeoLite2 databases.

API

IPInfo provides two API endpoints, taking an edit id or a log id. If the edit or log was performed by an anonymous user (or the log target is an anonymous user), the API returns data about the relevant IP address(es).

Client-side UI

Currently IPInfo adds a button next to IP addresses on Special:Log and history pages. The data are retrieved on clicking this button, and displayed in a popup. This design may change.

Deployment

We expect the feature to be deployed to all wikis, and be available on certain pages that show IP addresses. It will initially be available only to checkusers.

Preventing abuse

Sending an edit/log ID rather than the IP address will allow the IP address to remain hidden (once they become hidden in the future), and is intended to prevent IPInfo from being used as an API for getting information about arbitrary IP addresses.

Only users with the 'ipinfo' right can see the information. At first this right will only be given to checkusers. In the future when IP addresses are no longer visible, more users may need to access this information, e.g. for patrolling to fight vandalism. This will depend on user research and testing.

Users will sign an agreement when first using this tool, and a record will be kept of which users have access. Access may be taken away from a user, and a record will be kept of this too: T264150.

Preview environment

(Insert one or more links to where the feature can be tested, e.g. on Beta Cluster.)

Hosting the changes on Beta Cluster is a requirement prior to performance review. Please ensure that the feature can be used directly on the link(s) provided, without any data entry such as having to create an article. Any sample content needed should already be present.

If the changes cannot be hosted on Beta Cluster, explain why and provide links to an alternate public environment instead where the Performance Team can also SSH into. Links to code only is insufficient for a performance review.

IPInfo is available to logged-in users on our test environment: https://thegoodplace.wmcloud.org/index.php?title=Special:Log

Clicking buttons next to the IP addresses will result in a popup displaying either data or an error message.

Which code to review

(Provide links to all proposed changes and/or repositories. It should also describe changes which have not yet been merged or deployed but are planned prior to deployment. E.g. production Puppet, wmf config, or in-flight features expected to complete prior to launch date, etc.).

Gerrit: https://gerrit.wikimedia.org/r/admin/repos/mediawiki%2Fextensions%2FIPInfo|mediawiki/extensions/IPInfo
Github: https://github.com/wikimedia/mediawiki-extensions-IPInfo

Performance assessment

Please initiate the performance assessment by answering the below:

  • What work has been done to ensure the best possible performance of the feature?

Data for each IP address is requested on demand, rather than holding up page load.

Data is only requested once for each log entry or edit on the page.

  • What are likely to be the weak areas (e.g. bottlenecks) of the code in terms of performance?

We may be limited by the performance of GeoIP2 library.

In future versions, if more datasets and services are used, a likely bottleneck will be the performance and availability of the 3rd party webservice.

  • Are there potential optimisations that haven't been performed yet?

Depending on the license and the product requirements, we may be able to cache revisions/logs made by anon users and make requests to our API without credentials (which will give all users the information from the varnish cache).

A new popup widget is appended for each button. The data for each log entry or edit is cached in the widget, but data for the same IP address may be requested more than once if it occurs in a different log entry or edit. We could re-use the same popup widget, and we could store a map of the data for each IP address rather than perform several requests for the same IP address (but different log/edit IDs). The design is subject to change, and we do not yet know how the tool will be used.

  • Please list which performance measurements are in place for the feature and/or what you've measured ad-hoc so far. If you are unsure what to measure, ask the Performance Team for advice: performance-team@wikimedia.org.

We plan on implementing logging via an EventLogging schema, including how long the requests take and how often the tool is used.

Related Objects

StatusSubtypeAssignedTask
ResolvedSTran
ResolvedSTran
OpenNone
OpenNone
Resolved Niharika
ResolvedTchanders
Resolved Niharika
Resolved AGueyte
ResolvedSTran
ResolvedSTran
Resolved AGueyte
Resolved Niharika
ResolvedSTran
Resolved Niharika
ResolvedTchanders
InvalidNone
InvalidNone
InvalidNone
ResolvedSTran
ResolvedSTran
ResolvedSpikephuedx
ResolvedSTran
Resolved TThoabala
Resolvedphuedx
DeclinedTchanders
ResolvedSTran
ResolvedSTran
ResolvedTchanders
ResolvedTchanders
Resolvedsbassett
ResolvedDec 15 2020Tchanders
ResolvedTchanders
ResolvedTchanders
InvalidNone
ResolvedSep 22 2020Tchanders
ResolvedSep 22 2020Tchanders
ResolvedTchanders
Resolveddbarratt
ResolvedTchanders
Resolveddbarratt
ResolvedTchanders
Resolvedsbassett
Resolved Niharika
InvalidNone
StalledNone
ResolvedSecurityUrbanecm
ResolvedUrbanecm

Event Timeline

Niharika added a project: IP Info.
Niharika renamed this task from Performance review of IP Info to Performance review of IP Info [WIP].Aug 19 2020, 4:54 PM
Niharika renamed this task from Performance review of IP Info [WIP] to Performance review of IP Info [WIP] [4H].Aug 19 2020, 5:07 PM
dbarratt updated the task description. (Show Details)

Thanks for laying out this information @dbarratt.

We should bear in mind that this review will happen next quarter, once we've implemented something preliminary. We should also bear in mind that the approach we take will be constrained by the licenses; if they don't allow us to store data, we'll have to use option 3 - which is why we're taking that approach to begin with.

(Note that, although we're storing the MaxMind data, @aezell found that our license won't allow us to give our users information about other IP addresses from that dataset. Currently we only send details about their own IP address.)


With that in mind I'd suggest a few changes:

There are at least three ways to accomplish this: [...]

We should make it clear that this might be out of our hands, and highlight that we're particularly interested in discussing option 3, since this seems the most risky performance-wise, but also the most likely license-wise.

Are there any objects to any of the methods above?
If so, are there ways we can mitigate these objections?
Can you think of any other ways we might be able to solve this problem?
Which solution would your team prefer?

These questions might be a bit too broad by the time we have the review... We'll most likely have implemented a basic (though not finalised) version of option 3 by then. We may be in a position to discuss alternative approaches, but it'll depend on the licenses.

@Tchanders I added some clarity on which option we would be perusing and removed the questions. How does that look?

Tchanders renamed this task from Performance review of IP Info [WIP] [4H] to Performance review of IP Info extension.Sep 9 2020, 5:45 PM
Tchanders updated the task description. (Show Details)

Tagging Performance-Team. We're at the start of the project so we'll add more details as we can, but wanted to request this sooner rather than later. Thanks.

Gilles added subscribers: dbarratt, Gilles.

Are you aware of this history: https://phabricator.wikimedia.org/T100902#2566841 ?

I know that we are paying for the commercial version of MaxMind for some of our servers, but I don't know under which license terms.

The performance of querying such a dataset depends on which one it is. For example querying MaxMind directly is pretty fast and potentially faster than a homemade solution, because it stores this data efficiently and has dedicated tooling for querying it. Ensuring that you have proper caching at the ATS/Varnish layer might be a sufficient caching layer in front of the API that queries the IP database. The Traffic team might have a different opinion about that, it's worth consulting them. If frontend cache is considered insufficient, given that the same IPs will be requested over and over again, a local or DC-local intermediary cache (eg. memcached) seems like a good solution to put in front of the IP database.

Since making an API that can return information about any IP address is probably out of the question, given that we shut down a service that did just that 4 years ago due to abuse. I presume that this data will be presented on demand in a lot of contexts, like with hovercards. In that case, it's counter-productive in terms of performance to look up the information about all the IPs present at the time the HTML is generated, if only a small percentage of users will hover those IPs to see that information. Which means that you'll need an API to let the client-side fetch that information on demand. In that case, the simplest course of action is probably to add an encrypted signature/key to the HTML, specific to the lookup of that particular IP. The API will consume that key, verifying that it came from our own HTML (based on a secret key that 3rd-parties can't have) and that the request was specifically for that IP. In that scenario, however, we have to be mindful of the time consumed to generated the encrypted signatures. If it is greater than the time it takes to get the IP information, then we might as well include it in the HTML to begin with and not have any API call at all for hovercards.

Since making an API that can return information about any IP address is probably out of the question, given that we shut down a service that did just that 4 years ago due to abuse.

The plan is to create an endpoint(s) that accept a log id or revision id (See T260603). Effectively users would only be able to lookup IP addresses that have made logged actions or revisions.

We may create an endpoint that would allow for any IP address, but this would be limited to use by privileged users (probably only checkusers), and would only exist until T261555 is completed.

Thanks for the information and suggestions @Gilles.

We had read about that previous service - thanks for bringing it up. We're planning to accept a log/revision ID as @dbarratt says. We'd also be adding an 'ipinfo' permission, that is only available to certain user groups (starting with checkusers). In this situation, do you think we'd still want to verify that the revision/log ID is present on the page?

The only reason it might accept an IP address would be for checkusers who are seeing the IP address of a logged-in user on Special:CheckUser or Special:Investigate - that IP address wouldn't be associated with a revision/log ID. However, the only reason to provide this tool on these pages would be for the checkusers' convenience, and not for the sake of IP masking. Unless this tool could provide better information than the information that checkusers can get by copying and pasting the IP address into their preferred tools, that would arguably be pointless. @Niharika - do you think we could revisit whether this should be available for checkusers investigating logged-in users' IP addresses? If not then it wouldn't need to accept IP addresses.

@aezell is currently discussing the MaxMind license terms, but under the present terms we would not be able to make use of our MaxMind dataset for this tool. I think one of the most significant questions here will be whether we are able to agree a license with MaxMind or any other service that allows us to do any type of storage/caching of their data. If not, then the third party would need to be queried on demand. Since we're having to start building this product before these questions are resolved, it seems sensible to me to assume the worst-case scenario for now, that we can't store any data. I suspect we'd be open to changing the product specifications a bit if it was felt that this was beyond what was acceptable performance-wise - e.g. not using hover (@Niharika or @Prtksxna might want to comment further).

@Niharika I suppose there's a tension here in that, from a product perspective we might prefer to use the service(s) with the highest quality data, but from a performance perspective we might prefer to use the service(s) with the most lenient license.

I'm told that MaxMind has a new competitor, https://db-ip.com/ which makes their dataset available in MaxMind's mmdb format: https://db-ip.com/db/ which means that the related tooling and libraries work with it. It's probably worth looking into their T&C as well. The free version of that database is CC BY licensed: https://db-ip.com/db/ip-to-location That could be a great starting point.

Chances are that this PHP library https://github.com/maxmind/GeoIP2-php pointed at the db-ip free database would perform well. Your new endpoint/API will need to be instrumented anyway, if the PHP library doesn't perform well it will be obvious.

I'm told that MaxMind has a new competitor, https://db-ip.com/ which makes their dataset available in MaxMind's mmdb format: https://db-ip.com/db/ which means that the related tooling and libraries work with it. It's probably worth looking into their T&C as well. The free version of that database is CC BY licensed: https://db-ip.com/db/ip-to-location That could be a great starting point.

They have a number of competitors (See T248525). This isn't one I was aware of, but I can add it to my spreadsheet.

Chances are that this PHP library https://github.com/maxmind/GeoIP2-php pointed at the db-ip free database would perform well. Your new endpoint/API will need to be instrumented anyway, if the PHP library doesn't perform well it will be obvious.

It looks like we may be able to use our existing MaxMind license. I've created T262963 to have their library reviewed by the Security-Team in order to have it (and the dataset we currently pay for) deployed to production. :)

@aezell is currently discussing the MaxMind license terms, but under the present terms we would not be able to make use of our MaxMind dataset for this tool.

Update from @aezell is apparently that we can use it, so that's what we'll do - simplifies the problem a lot.

According to @Niharika we may want to look into using multiple services in the future, but not for the first iteration.

Unrelated to the performance review, but there seems to be a current limitation in ipv6 address comparison and 0-padding that I saw on the test wiki:

Screenshot 2020-11-03 at 13.11.35.png (397×349 px, 46 KB)

I see that the API responses don't contain any caching information:

Screenshot 2020-11-03 at 13.14.39.png (404×525 px, 55 KB)

While the JS seems to account for not re-requesting past API requests for a given IP in the current pageview, this means that after a page refresh or visiting another page with the same IP, you would be missing out on leveraging the browser cache. I think it would make sense to cache these API responses at least in the user agent, and it could be done for a long period, since the IP geo database isn't updated that frequently.

You should instrument the time it takes between a click on the (i) button and when the information is actually displayed. This can easily be done with mw.track. By measuring performance as experienced by real users this way, you will be able to see if there is any degradation once this feature gets more a lot more use once in production. You also get to track usage for free by doing this.

I see that the ResourceLoader module is correctly loaded only when dealing with the Special:Log page and a user who has the right privileges, that's great.

I would recommend against the placeholder animation with diagonal grey lines currently displayed before the results of the API request come in, since the API responses are usually very fast:

Screenshot 2020-11-03 at 13.33.55.png (155×433 px, 31 KB)

It's best to avoid drawing attention to short waits. Animated progress indicators become useful when the user is waiting for a long time (eg. waiting for the result of an expensive advanced search query). If you're worried that some users might wait 5-10 seconds for those API responses sometimes, then it's best to wait a couple of seconds before displaying the animation. This would ensure that most users (who will have a response under 2 seconds, for example) don't see the animation, and those who have very slow internet connections would see it, as they would be waiting a while for a result.

Looking at init.js, some of the variable initialisations could be done more lazily: https://github.com/wikimedia/mediawiki-extensions-IPInfo/blob/master/modules/ext.ipInfo/init.js#L4-L10 You're immediately looking for revIdAncestor and logIdAncestor in the DOM, as well as creating the button, when there are conditions - like not being a valid IP address - that will leave that unused. It's best to postpone doing those DOM lookups and object creation until you're 100% sure that they're going to be needed.

which will give all users the information from the varnish cache

Is the plan for this feature to be ultimately accessible to everyone, including logged-out users?

Is the plan for this feature to be ultimately accessible to everyone, including logged-out users?

No. This feature would be available to some logged-in users. We plan to make it available to checkusers only first and then potentially admins.

@Gilles Thanks for the thorough review - and not just of performance!

Unrelated to the performance review, but there seems to be a current limitation in ipv6 address comparison and 0-padding that I saw on the test wiki:

We're working on this - filed as T266881 and T267015

I think it would make sense to cache these API responses at least in the user agent, and it could be done for a long period, since the IP geo database isn't updated that frequently.

Filed as T267230

You should instrument the time it takes between a click on the (i) button and when the information is actually displayed.

WIP task - T267235. This is WIP until the design is finalised (the button+popup might be changed).

I would recommend against the placeholder animation with diagonal grey lines currently displayed before the results of the API request come in, since the API responses are usually very fast

Filed as T267237

Looking at init.js, some of the variable initialisations could be done more lazily

Filed as T267240

which will give all users the information from the varnish cache

Is the plan for this feature to be ultimately accessible to everyone, including logged-out users?

I believe David was just thinking theoretically here.

Gilles changed the task status from Open to Stalled.Nov 5 2020, 1:08 PM

Awesome, thanks for filing all these tasks. That's all for now, since you mention that the design might change, please ping me when that's the case, so I can look at the changes. If there's any other major change, please move the status of this task back to "Open" and point me to what needs to be looked at.

Hi @Niharika: I see some movement in other tasks and such, is there a new expected timeline when Performance needs to do their review?

Hi @Niharika: I see some movement in other tasks and such, is there a new expected timeline when Performance needs to do their review?

Oh yes! We are working on the MVP and hope to have the tool ready for soft launch in the next couple months. I am not exactly sure when we will be ready for a performance review so I'll ping @phuedx who knows better.

Tchanders changed the task status from Stalled to Open.Feb 1 2022, 1:59 PM
Tchanders added a subscriber: JayCano.

Hi @Niharika: I see some movement in other tasks and such, is there a new expected timeline when Performance needs to do their review?

Oh yes! We are working on the MVP and hope to have the tool ready for soft launch in the next couple months. I am not exactly sure when we will be ready for a performance review so I'll ping @phuedx who knows better.

@greg Are you the right person for us to contact about a re-review?

We have a proposed plan to deploy IPInfo to testwiki in a month and to the rest of our production sites as a BetaFeature by the end of the quarter. However, this may be blocked on a performance review, as more features have been added since T260821#6605932. What would you advise we do?

Also copying in @JayCano, who is our (not-so) new engineering manager.

Niharika claimed this task.

We talked about this within the team and it seems like we have not made any major changes that may impact performance since the last review. With that in mind, we have decided to go ahead with the deploys without requesting another review. Thank you.