Kernel News

Inheriting Filesystem Capabilities

Christoph Lameter posted a patch to make filesystem capabilities inheritable the way the SUID bit is. When you set the SUID bit in an executable and another user runs that executable, it runs with your permissions, rather than the permissions of that user. Any files it creates, or other programs it invokes, are likewise run as you instead of as that user.

Capabilities don’t have that kind of inheritability. So, if you write a script and give it certain capabilities, such as allowing raw network access, any scripts invoked by that script will not have the capability to do raw network access. Thus, the script would not be able to rely on any other tools to help do that part of its work. Christoph said, “This is behavior that is counterintuitive to the expected behavior of processes in Unix.”

Making capabilities inheritable, Christoph said, was preferable to simply running executables with the SUID bit set. The SUID bit is a very blunt tool, giving the executable *all* the permissions of its owner; whereas capabilities are more surgical, allowing you to constrain those permissions to just the set what is needed.

Christoph pointed out that this had been a problem for quite awhile and that no better alternative seemed to be available. He remarked that “some involved in security development under Linux have even stated that they want to rip out the whole thing and replace it.” He explained:

This patch does not change the default behavior but it allows to set up a list of capabilities in the proc filesystem that will enable regular unix inheritance only for the selected group of capabilities.

With that it is then possible to do something trivial like setting CAP_NET_RAW on an executable that can then allow that capability to be inherited by others.

Christoph also added, “I usually do not dabble in security and I am not sure if this is done correctly. If someone has a better solution then please tell me but so far we have not seen anything else that actually works.”

Serge Hallyn felt there were some dangers here. POSIX capabilities were tied to the privileges of both the user and the file itself, whereas Christoph’s code seemed to tie capabilities to just the file. Serge suggested adding a new capability, listing the capabilities available to be inheritable by that particular user. The user could then choose which capabilities would be inheritable from a given executable. This way both user privileges and file privileges would be respected.

However, Serge also said, “Not saying this is a good idea necessarily, but worth thinking about.”

Casey Schaufler remarked that the POSIX draft relevant to this whole question was only a draft document that had ultimately been withdrawn. So, there could be no question of true POSIX conformance or lack thereof on this issue. Casey also said:

The POSIX capability scheme is the simplest mechanism we could come up with that allows existing setuid programs to work unmodified and still make it possible to constrain specific capabilities. Is it complicated? Yes. Why is it complicated? Because you need the option of using the file capabilities to raise and lower the privilege of a program. Had we the option of requiring the programs to do that themselves, the whole thing would have been easier. You also need the option of having a capability aware program manipulate it’s own capabilities.

All the UNIX systems that implemented capabilities did so using one variate or another of the POSIX scheme. One, Trusted IRIX, successfully eliminated root privilege.

In terms of Christoph’s comment that some security folks wanted to rip out capabilities entirely and replace them with something else, Casey remarked, “I’m game to participate in such an effort. The POSIX scheme is workable, but given that it’s 20 years old and hasn’t developed real traction it’s hard to call it successful.”

To address POSIX capabilities’ lack of traction over 20 years, Serge said, “I personally think it’s two things: 1. lack of toolchain and fs support. The fact that we cannot to this day enable ping using capabilities by default because of cpio, tar and non-xattr filesystems is disheartening. 2. It’s hard for users and applications to know what caps they need. Yes the API is a bear to use, but we can hide that behind fancier libraries. But using capabilities requires too much in-depth knowledge of precisely what caps you might need for whatever operations library may now do when you asked for something.”

In response to Serge’s first point, Mimi Zohar said, “We’re working on resolving the CPIO issue. tar currently supports xattrs. At this point, how many non-xattr file systems are there really?” Austin Hemmelgarn replied, “FAT* and UFS immediately come to mind, and I know of people who use UFS for their root filesystem.”

In response to Serge’s second point, Casey said, “If the audit system reported the capabilities relevant to the decision you’d have what you need. If you failed because you didn’t have CAP_CHMOD or you succeeded because you had CAP_SYS_ADMIN it should show up in the audit record. Other systems have used this approach.”

Andy Lutomirski, however, didn’t agree with Serge’s point about needing filesystem support. He said, “if I hold a capability and I want to pass that capability to an exec’d helper, I shouldn’t need the fs’s help to do this.” To which Christoph said, “amen!”

At a certain point, Christoph reined in the discussion somewhat, reiterating that the problem had lingered for too long and needed a real live patch. He reiterated that in his patch, “the file being executed can inherit the parent caps without having to set caps in the new executable.” He wasn’t going for anything fancier than that, given that nothing fancier was actually on the table.

Serge said that he actually still preferred his earlier suggestion of introducing a new capability that listed inheritable capabilities. Christoph asked for a real live patch, and, at that point, folks delved into a technical discussion addressing various fine points and implementation details, with objections and affirmations focusing more on the code than on the high-level direction.

At this point, it seems that one form or another of Christoph’s desired inheritability feature will probably eventually go into the kernel. There is still some controversy surrounding it, however. For one thing, as Casey pointed out, POSIX capabilities were never truly standardized. For another, there is an existing base of software that still needs to run properly, on top of whatever solution comes along. And, finally, there are security concerns that trump all other concerns but that also tend to be somewhat convoluted. All of these things are a recipe for compromise that makes it hard to predict what the final result will look like.

Reporting on Inactive CPUs

Some ideas that seem good end up going nowhere, at least for awhile.

Yalin Wang posted a patch to cause /proc/stat to list all CPUs on a running system, regardless of whether a given CPU was online or offline. The reason, he said, was that some CPUs went online or offline dynamically and might need to be tracked. And, if a library wanted to know how many CPUs were on the system, it should get the real number, rather than just the number of CPUs currently online.

David Rientjes liked the idea, but he didn’t think it was necessary to add the information to /proc/stat. The /sys/devices/cpu file reports the number of CPUs in the system. It made more sense to update that file to list all CPUs instead of just online CPUs. Andrew Morton also pointed out that /proc/cpuinfo should be updated to list all CPUs as well.

At this point, however, Yalin hit a stumbling block. In the Android kernel, some code depended on these files to determine the number of online CPUs rather than their total number. If he changed the behavior of those files, he’d break compatibility.

So, that was the end of that.

Adding Timekeeping Tests to the Kernel

John Stultz posted some timekeeping test patches. He’d hosted them on GitHub for a few years, but now he had the time to make them kernel-ready, so he wanted to get some feedback on what he should change.

The tests did things like setting the system time to something that might lead to problems, like that last moment of the last day of the year 1999 or something like that. Some of John’s tests could have a destructive effect along those lines, and some would produce quiet warnings if the system behaved in an improper but non-threatening way.

Richard Cochran liked the patches, and Shuah Khan did as well. She suggested having non-destructive tests run by default and having the user specify destructive tests as desired. She also suggested that John “use kselftest.h reporting mechanism for new tests. posix_timers.c is updated to use it and it would make sense use it for new tests as well.”

John said he’d give that a try, although he also added, “one thing I’ve tried to do with my test suite is minimize any sort of test-infrastructure dependencies, so as much as possible, single test files can be plucked out, built and run by themselves.” But, he said he’d ditch that plan if no one was into it.

Shuah saw the value in his approach and said that adapting to kselftest was not mandatory. She suggested he give it a shot, but “if it does become hard, I am not going to make it a requirement to use it.”

It seems clear that John’s patches will get into the kernel soon, perhaps with some changes. They don’t seem to have any controversy attached.

Freezing File Writes

Namjae Jeon posted some patches to implement file freezing – the idea being that a file write could be blocked temporarily, even if a process already held an open file descriptor. He said, “File write freeze functionality, when used in conjunction with inode’s immutable flag, can be used for creating truly stable file snapshots, wherein write freeze will prevent any modification to the file from already open file descriptors, and immutable flag will prevent any new modification to the file. One of the intended uses for stable file snapshots would be in the defragmentation applications which defrags single file.”

Dmitry Monakhov offered some technical suggestions for Namjae’s patch but overall thought the work was interesting. Jan Kara also had some technical suggestions and pointed out that some of Namjae’s test code contained race conditions. Namjae responded to all suggestions with code and pseudocode that might improve the behavior.

Jan also pointed out some problems with the overall design. Jan said:

Doing fs-wide freezing from userspace makes sense as Dmitry pointed out. We can then just fail FS_IOC_FWFREEZE with error when the whole fs isn’t frozen. I’m just somewhat worried whether the fs-wide freezing won’t be too fragile. E.g. consider a situation when you are running a defrag program which is freezing and unfreezing the filesystem and then some background work kicks which will want to snapshot the file system so it will freeze & unfreeze the fs as well. Now depending on how exactly defrag and snapshot race one of the FIFREEZE ioctls will return EBUSY and the process (hopefully gracefully) fails.

This isn’t a new situation – if you ran two snapshots at once, you’d see the same failure. But the more fs-wide freezing gets used in different places the stranger and less expected failure you’ll see … .

Dave Chinner also had some technical suggestions, although he agreed that some of the code was “terrible racy.” Namjae continued to offer new patches and ideas for how to adapt all the feedback.

It seems as though there are two issues getting in the way of Namjae’s patches. The first would be the race conditions, and the second would be the fragility of the code itself – the likelihood that new race conditions could be inadvertently inserted. Overall, the feature itself – freezing writes – seems like something that no one opposes. So, you could expect some kind of implementation in the not-too-distant future.

Leave a Reply

Your email address will not be published.

captcha

Please enter the CAPTCHA text