Tuesday, September 6, 2016

libinput and the Lenovo T450 and T460 series touchpads

I'm using T450 and T460 as reference but this affects all laptops from the Lenovo *50 and *60 series. The Lenovo T450 and T460 have the same touchpad hardware, but unfortunately it suffers from what is probably a firmware issue. On really slow movements, the pointer has a halting motion. That effect disappears when the finger moves faster.

The observable effect is that of a pointer stalling, then jumping by 20 or so pixels. We have had a quirk for this in libinput since March 2016 (see commit a608d9) and detect this at runtime for selected models. In particular, what we do is look for a sequence of events that only update the pressure values but not the x/y position of the finger. This is a good indication that the bug triggers. While it's possible to trigger pressure changes alone, triggering several in a row without a change in the x/y coordinates is extremely unlikely. Remember that these touchpads have a resolution of ~40 units per mm - you cannot hold your finger that still while changing pressure [1]. Once we see those pressure changes only we reset the motion history we keep for each touch. The next event with an x/y coordinate will thus not calculate the delta to the previous position and not trigger a move. The event after that is handled normally again. This avoids the extreme jumps but there isn't anything we can do about the stalling - we never get the event from the kernel. [2]

Anyway. This bug popped up again elsewhere so this time I figured I'll analyse the data more closely. Specifically, I wrote a script that collected all x/y coordinates of a touchpad recording [3] and produced a black and white image of all device coordinates sent. This produces a graphic that's interesting but not overly useful:

Roughly 37000 touchpad events. You'll have to zoom in to see the actual pixels.
I modified the script to assume a white background and colour any x/y coordinate that was never hit black. So an x coordinate of 50 would now produce a vertical 1 pixel line at 50, a y coordinate of 70 a horizontal line at 70, etc. Any pixel that remains white is a coordinate that is hit at some point, anything black was unreachable. This produced more interesting results. Below is the graphic of a short, slow movement right to left.

A single short slow finger movement
You can clearly see the missing x coordinates. More specifically, there are some events, then a large gap, then events again. That gap is the stalling cursor where we didn't get any x coordinates. My first assumption was that it may be a sensor issue and that some areas on the touchpad just don't trigger. So what I did was move my finger around the whole touchpad to try to capture as many x and y coordinates as possible.

Let's have look at the recording from a T440 first because it doesn't suffer from this issue:

Sporadic black lines indicating unused coordinates but the center is purely white, indicating every device unit was hit at some point
Ok, looks roughly ok. The black areas are irregular, on the edges and likely caused by me just not covering those areas correctly. In the center it's white almost everywhere, that's where the most events were generated. And now let's compare this to a T450:

A visible grid of unreachable device units
The difference is quite noticeable, especially if you consider that the T440 recording had under 15000 events, the T450 recording had almost 37000. The T450 has a patterned grid of unreachable positions. But why? We currently use the PS/2 protocol to talk to the device but we should be using RMI4 over SMBus instead (which is what Windows has done for a while and luckily the RMI4 patches are on track for kernel 4.9). Once we talk to the device in its native protocol we see a resolution of ~20 units/mm and it looks like the T440 output:

With RMI4, the grid disappears
Ok, so the problem is not missing coordinates in the sensor and besides, at the resolution the touchpad has a single 'pixel' not triggering shouldn't be much of a problem anyway.

Maybe the issue had to do with horizontal movements or something? The next approach was for me to move my finger slowly from one side to the left. That's actually hard to do consistently when you're not a robot, so the results are bound to be slightly different. On the T440:

The x coordinates are sporadic with many missing ones, but the y coordinates are all covered
You can clearly see where the finger moved left to right. The big black gaps on the x coordinates mostly reflect me moving too fast but you can see how the distance narrows, indicating slower movements. Most importantly: vertically, the strip is uniformly white, meaning that within that range I hit every y coordinate at least once. And the recording from the T450:

Only one gap in the y range, sporadic gaps in the x range
Well, still looks mostly the same, so what is happening here? Ok, last test: This time an extremely slow motion left to right. It took me 87 seconds to cover the touchpad. In theory this should render the whole strip white if all x coordinates are hit. But look at this:

An extremely slow finger movement
Ok, now we see the problem. This motion was slow enough that almost every x coordinate should have been hit at least once. But there are large gaps and most notably: larger gaps than in the recording above that was a faster finger movement. So what we have here is not an actual hardware sensor issue but that the firmware is working against us here, filtering things out. Unfortunately, that's also the worst result because while hardware issues can usually be worked around, firmware issues are a lot more subtle and less predictable. We've also verified that newer firmware versions don't fix this and trying out some tweaks in the firmware didn't change anything either.

Windows is affected by this too and so is the synaptics driver. But it's not really noticeable on either and all reports so far were against libinput, with some even claiming that it doesn't manifest with synaptics. But each time we investigated in more detail it turns out that the issue is still there (synaptics uses the same kernel data after all) but because of different acceleration methods users just don't trigger it. So my current plan is to change the pointer acceleration to match something closer to what synaptics does on these devices. That's hard because synaptics is mostly black magic (e.g. synaptics' pointer acceleration depends on screen resolution) and hard to reproduce. Either way, until that is sorted at least this post serves as a link to point people to.

Many thanks to Andrew Duggan from Synaptics and Benjamin Tissoires for helping out with the analysis and testing of all this.

[1] Because pressing down on a touchpad flattens your finger and thus changes the shape slightly. While you can hold a finger still, you cannot control that shape
[2] Yes, predictive movement would be possible but it's very hard to get this right
[3] These are events as provided by the kernel and unaffected by anything in the userspace stack


Christian von Kietzell said...

Did you contact Lenovo about this? And if so, how did they respond (if at all)?

kparal said...

Great investigation, thanks.