Page MenuHomekolab.org

[WIP] Files API
ClosedPublic

Authored by machniak on Mar 15 2022, 12:11 PM.

Details

Reviewers
vanmeeuwen
Group Reviewers
Restricted Project
Commits
rKb102fdd0f8d4: [WIP] Files API
rKd77c5d69f16a: [WIP] Files API
Test Plan

./phpunit

Diff Detail

Repository
rK kolab
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

machniak requested review of this revision.Mar 15 2022, 12:11 PM
machniak created this revision.

According to our discussion in jabber the idea was to have the following split:

  • Library: Controls where files are stored (so for kolabnow we'd have at least initially a single library for all users). This is not exposed to the user. (An alternative name for this could be "Repository")
  • Share: Controls who can see files (and probably maintian other acl's as well). Initially each user would have one share giving him access to his files.

A file must belong to exactly one Library and can belong to 1-n Shares (so we don't orphan files).

You are not currently following this model so I think we should discuss if there is a reason why, or if you just did something to have a starting point.

machniak updated this revision to Diff 9931.Mar 21 2022, 2:06 PM
  • Fix sharing
machniak updated this revision to Diff 9934.Mar 21 2022, 2:08 PM
  • Add missing FilePermission model
Harbormaster completed remote builds in B38719: Diff 9934.

@mollekopf I just added the sharing feature. What do you think about this now?

I'm not sure about the permissions model.

In principle this of course works and should be efficient for checking permissions, but UX wise I think we'll want something different, and while we could build it on top it might make things more complicate in other ways.

Instead of individual file permissions I would propose the following (which I think is in line with what we discussed before on jabber):

  • "File"s can be part of 1-n "Share"s
  • A user can have an "AccessRight" (read, write, admin) to a Share

This provides a level of indirection so files can be grouped and shared that seems more useful to me. It's still possible to set acl's for a single user for a single file, but you can also manage a group of files.
For performance reasons we could of course still have a table/view/index that allows for a direct lookup of the permissions a user has for a file, but I'm not sure it would be necessary.

src/app/File.php
26 ↗(On Diff #9886)

I don't think we should store folders as files, if we decide to have folders at all.

For the final file storage (not the intermediate files used for upload), we'll want a way to associate files with a "Library", which can have different storage backends (local filesystem, s3, ...). Not a blocker for a first step, just so we agree on the direction.

Ok, so the current schema does not cover folders and sharing of folders. I have a simple solution for that:

--- a/src/database/migrations/2022_03_02_100000_create_filesystem_tables.php
+++ b/src/database/migrations/2022_03_02_100000_create_filesystem_tables.php
@@ -30,6 +30,7 @@ function (Blueprint $table) {
             function (Blueprint $table) {
                 $table->string('id', 36)->primary();
                 $table->string('library_id', 36);
+                $table->string('parent_id', 36)->nullable()->index(); // parent folder identifier
                 $table->string('name', 512);
                 $table->bigInteger('size')->unsigned()->default(0);
                 $table->string('mimetype');

I.e. we add parent_id column to the files table (well, it maybe should be called nodes or entries, with or w/o filesystem_ prefix?) and we store folders with files in the same table (using e.g. application/x-folder for mimetype). This way listing folders+files in a specified point of the tree is a single select query. And file_permissions (or just permissions?) table would cover sharing folders and files.

machniak updated this revision to Diff 9952.Apr 8 2022, 1:57 PM
  • File upload/download via unauth API location
  • File update API
  • Files API schema/implementation refactor (slim-down)
machniak updated this revision to Diff 9958.Apr 8 2022, 3:30 PM
  • Storage error handling
machniak updated this revision to Diff 9964.Apr 11 2022, 3:53 PM
  • Store files in chunks (don't concatenate), some refactorings
machniak updated this revision to Diff 9970.Apr 12 2022, 10:15 AM
  • Make max chunk size depend on SWOOLE_PACKAGE_MAX_LENGTH option

Some high-level observations from a first glance:

  • The permission->id is used as the unique identifier for a share-link, which means the same user will always get the same link, but each user will receive a different link.
  • Instead of reassembling files individual chunks are now stored, which means we can save ourselves the hassle of reassembling the files and just serve the same chunks for download (I suppose? Haven't checked)

Please note I'm obviously biased in my assessment because I'm still aiming for the "Share" solution, so take my comments with a grain of salt.

Regarding the first point:
I would have preferred to not have "Permissions" at all (not in this form anyways). For invidivual links I think it would be more flexible to have "ShareLink" table, which for the time being simply ties a user to read-only access of a file. I think those "ShareLinks" could grow into something more powerful, but for the time being that's all we need, if we're only doing read-only sharing.
Since I would be doing permissions via Shares in the long run, to me the Permissions table only makes things more complicated to migrate, without providing any benefit for what we want initially.

Regarding the second point:
While I quite like the solution in general I'm wondering if we're not going to make our live more complex once we look at e.g. storing files in S3 or alike. Perhaps it would be simpler to just say we always end up with a file that we can then push somewhere (local filesystem, s3, ...), but on the other hand, we can of course reassemble the chunks before pushing to s3. So I'm not sure which way to go, but would have just reassembled the file (seems simpler), unless there is a good reason to not do so (e.g. an actual performance/memory problem).

These requested changes are mostly related to the database schema, I trust I do not have to highlight related changes.

src/database/migrations/2022_03_02_100000_create_filesystem_tables.php
16

fs_items, so we may introduce a column fs_type (bitflip);

Example (very speculative):

1 - regular file
2 - regular directory
4 - directory with photo album
8 - directory with backups
16 - directory with documents
32 - archived directory
64 - shared (basically allowing us to draw attention to it?)
128 - AES encrypted
256 - ?? etc.

meaning that fs_type == 36 contains some really old photos, and fs_type == 100 means you have it shared probably with some family members.

20

should be in fs_attributes, with an observer that updates this column (because it is what we have to list touching as few tables as possible)

21

not needed here, fs_attributes

22

not needed here, fs_attributes

25

this table can also use softDeletes()

27

additional index on user_id

34

moves to fs_attributes as key-value almost like a settings trait

52

if the terminology in upload and download is "chunking" (i don't know that it is), then let's name this fs_chunks

54

doesn't need an id column

55

additional columns chunk_id (string, 36) and sequence_id for ordering

65

remove beta- prefix

71

remove \Beta from namespace

vanmeeuwen requested changes to this revision.Apr 13 2022, 4:35 PM
This revision now requires changes to proceed.Apr 13 2022, 4:35 PM
mollekopf added inline comments.Apr 13 2022, 4:57 PM
src/database/migrations/2022_03_02_100000_create_filesystem_tables.php
34

In that case I would just store the share link in the attribute since we don't really need the permission at this point.

55

the chunk_id seems superfluous as file_id + sequence_id would already be unique.

  • If we move file name, size, mimetype to a separate table whenever we want to list/sort/search we have to make join(s) which will make it slow, and I'd say listing/searching by these three attributes is essential. A simplest listing needs to return at least the file size, name and mimetype, this makes this basic query much complicated/slower.
  • If we wanted to re-use the existing SettingsTrait the table name would have to be fs_file_settings, or you insist on "attributes"?
  • If we move permissions to settings/attributes then a "search by share id" query will be very slow, as it will be a non-indexed query on a big table. Therefore I'd keep the fs_permissions table as is.
  • App\Handlers\Beta\Files is there for a reason, it extends App\Handlers\Beta\Base class, but I guess I can move it to App\Handlers\Files. Note: that we still have Distlists, Resources, SharedFolders under \Beta, so it would not be consistent.
  • I understand fs_chunks needs a sequence column, but what's wrong with the incremented id? Ok, I guess with a generated id we can make it more performant, because we don't need to make an INSERT first. I.e. we'd generate chunk-id, save the chunk-file, then insert. Do you want to make the chunk id unique/primary key?
  • @mollekopf, the file_id + sequence is not unique enough, there might be more than one deleted record with the same file_id + sequence combination. Especially considering file updates and partial uploads. SO, I see a reason to have an extra id for the chunk filename.
  • If we move file name, size, mimetype to a separate table whenever we want to list/sort/search we have to make join(s) which will make it slow, and I'd say listing/searching by these three attributes is essential. A simplest listing needs to return at least the file size, name and mimetype, this makes this basic query much complicated/slower.

The assumption is that most if not all "lists" of "files" do not contain a column for the attributes such as size, mtime nor mimetype, let alone allow for "sorting by".

Furthermore, a "list of files" even with metadata does not require us to use a single roundtrip -- it can be a) a list of files and b) get some metadata -- avoiding joins.

That being said, where additional metadata such as size and/or mtime are to be included in the listing, we should probably consider the fs_items column values to be a derivative -- there could be many aspects that could justify updating mtime, such as changing permissions.

  • If we wanted to re-use the existing SettingsTrait the table name would have to be fs_file_settings, or you insist on "attributes"?

fs_item_settings I suppose? 'attributes' would be more natural, can it simply inherit/extend SettingsTrait or something creative to avoid duplication of code?

  • If we move permissions to settings/attributes then a "search by share id" query will be very slow, as it will be a non-indexed query on a big table. Therefore I'd keep the fs_permissions table as is.

I don't understand how/why a search by share id would be very slow. For wallet transactions for example, we already search a table that is millions of rows per calendar year. This is all about the right indexes and cardinality.

  • App\Handlers\Beta\Files is there for a reason, it extends App\Handlers\Beta\Base class, but I guess I can move it to App\Handlers\Files. Note: that we still have Distlists, Resources, SharedFolders under \Beta, so it would not be consistent.

Retroactively "fixing" those entitlements could be a thing, but probably does not need to happen in this implementation.

  • I understand fs_chunks needs a sequence column, but what's wrong with the incremented id? Ok, I guess with a generated id we can make it more performant, because we don't need to make an INSERT first. I.e. we'd generate chunk-id, save the chunk-file, then insert. Do you want to make the chunk id unique/primary key?

An auto-increment ID can not be influenced; it always goes up.

Further, in a MariaDB Galera environment, there's an additional increment for each server in the cluster. You can see this in the ticket numbers on this here Phabricator for example -- the auto-increment steps is 3.

  • @mollekopf, the file_id + sequence is not unique enough, there might be more than one deleted record with the same file_id + sequence combination. Especially considering file updates and partial uploads. SO, I see a reason to have an extra id for the chunk filename.

What is unique is file_id, seq_num and deleted_at.

  • If we move file name, size, mimetype to a separate table whenever we want to list/sort/search we have to make join(s) which will make it slow, and I'd say listing/searching by these three attributes is essential. A simplest listing needs to return at least the file size, name and mimetype, this makes this basic query much complicated/slower.

My understanding of the argument is this:

  • We pick a deliberately generic database layout, so we don't have to modify it as we build new feature. This is a deliberate trade-off that get's us flexibility, but we have to deal with some extra complexity and a performance hit for some queries.
  • As long as we can reduce the initial data-set to a sufficiently small set, the join is not going to be a performance problem. E.g. as long as we only have to join the extra metadata for as many files as fit on a screen, it shouldn't be a problem.
  • If we wanted to re-use the existing SettingsTrait the table name would have to be fs_file_settings, or you insist on "attributes"?
  • If we move permissions to settings/attributes then a "search by share id" query will be very slow, as it will be a non-indexed query on a big table. Therefore I'd keep the fs_permissions table as is.

If you are referring to the previously discussed concept of a "Share", here's how I think it would be implemented in the proposed db structure:

  • We have an item of type share.
  • We possibly have a separate table for item to item relations, or we use the attribute table.

A typical query will first list the shares that a user has, which should be a cheap query returning a couple of results (but possibly requiring a join to fetch some extra metadata).
Listing the files in the share would be querying the relation or attribute table for a list of files, and then listing the files again shouldn't be a problem.

Can you outline what problems you see?

  • App\Handlers\Beta\Files is there for a reason, it extends App\Handlers\Beta\Base class, but I guess I can move it to App\Handlers\Files. Note: that we still have Distlists, Resources, SharedFolders under \Beta, so it would not be consistent.

I agree we should opt for a consistent solution.
I generally like vanmeeuwens suggestion of not prefixing the sku name and handlers with Beta, but let's perhaps do that in a separate patch, and then for everything (if we all agree).

  • I understand fs_chunks needs a sequence column, but what's wrong with the incremented id? Ok, I guess with a generated id we can make it more performant, because we don't need to make an INSERT first. I.e. we'd generate chunk-id, save the chunk-file, then insert. Do you want to make the chunk id unique/primary key?
  • @mollekopf, the file_id + sequence is not unique enough, there might be more than one deleted record with the same file_id + sequence combination. Especially considering file updates and partial uploads. SO, I see a reason to have an extra id for the chunk filename.

It seems to me it should be unique enough if the chunks always end up identical, but that is perhaps too much of an assumption.

  • If we move file name, size, mimetype to a separate table whenever we want to list/sort/search we have to make join(s) which will make it slow, and I'd say listing/searching by these three attributes is essential. A simplest listing needs to return at least the file size, name and mimetype, this makes this basic query much complicated/slower.

The assumption is that most if not all "lists" of "files" do not contain a column for the attributes such as size, mtime nor mimetype, let alone allow for "sorting by".

Most? All file managers/pickers I know display file name, size and mtime (mimetype is often used to display a different icon).

Furthermore, a "list of files" even with metadata does not require us to use a single roundtrip -- it can be a) a list of files and b) get some metadata -- avoiding joins.

Not if you sort by or search by these fields. And sorting by size/name/mtime is essential.

That being said, where additional metadata such as size and/or mtime are to be included in the listing, we should probably consider the fs_items column values to be a derivative -- there could be many aspects that could justify updating mtime, such as changing permissions.

updating mtime is not very relevant in this context, imo.

  • If we wanted to re-use the existing SettingsTrait the table name would have to be fs_file_settings, or you insist on "attributes"?

fs_item_settings I suppose? 'attributes' would be more natural, can it simply inherit/extend SettingsTrait or something creative to avoid duplication of code?

Extending would not make sense, because getSettings() would become getAttributes(), etc.

  • If we move permissions to settings/attributes then a "search by share id" query will be very slow, as it will be a non-indexed query on a big table. Therefore I'd keep the fs_permissions table as is.

I don't understand how/why a search by share id would be very slow. For wallet transactions for example, we already search a table that is millions of rows per calendar year. This is all about the right indexes and cardinality.

Knowing how user_settings table looks like, how would you store permissions to make them easily/fast searchable by the permission identifier, please?

  • I understand fs_chunks needs a sequence column, but what's wrong with the incremented id? Ok, I guess with a generated id we can make it more performant, because we don't need to make an INSERT first. I.e. we'd generate chunk-id, save the chunk-file, then insert. Do you want to make the chunk id unique/primary key?

An auto-increment ID can not be influenced; it always goes up.

Further, in a MariaDB Galera environment, there's an additional increment for each server in the cluster. You can see this in the ticket numbers on this here Phabricator for example -- the auto-increment steps is 3.

But these aren't problems for the use-case we're working on here, imo. The uniqueness is the only thing we need. Do you want to make the chunk id unique/primary key if it's a generated uuid? I don't think it needs to be unique, file_id + chunk_id should be, but I wouldn't even check its existence before an INSERT.

  • @mollekopf, the file_id + sequence is not unique enough, there might be more than one deleted record with the same file_id + sequence combination. Especially considering file updates and partial uploads. So, I see a reason to have an extra id for the chunk filename.

What is unique is file_id, seq_num and deleted_at.

Not really. The soft-deleted chunks are to be deleted from the filesystem/db later, if for exampel the job fails, and you update the file multiple times you might end up with multple soft-deleted chunks of the file with same seq_num. Btw, I'd also store chunk size in this table (would be useful for partial uploads).

  • If we move file name, size, mimetype to a separate table whenever we want to list/sort/search we have to make join(s) which will make it slow, and I'd say listing/searching by these three attributes is essential. A simplest listing needs to return at least the file size, name and mimetype, this makes this basic query much complicated/slower.

The assumption is that most if not all "lists" of "files" do not contain a column for the attributes such as size, mtime nor mimetype, let alone allow for "sorting by".

Most? All file managers/pickers I know display file name, size and mtime (mimetype is often used to display a different icon).

Furthermore, a "list of files" even with metadata does not require us to use a single roundtrip -- it can be a) a list of files and b) get some metadata -- avoiding joins.

Not if you sort by or search by these fields. And sorting by size/name/mtime is essential.

If we decide to go with a generic database layout, we inherently make some queries less efficient. In the cases I have thought about I don't think it matters.

If you fetch a list of 50 files a join joining size/mtime/.... won't drastically matter.
Sorting by size can be essential, but if you're just sorting 50 items, you can just do it in memory, and we're not going sort all files of all users by size.
Same goes for name/mtime/....

You are of course correct that some specific queries are slower, but the assumption is that is just won't matter for what we would typically do, because we operate on relatively small datasets per user/share/folder/,
and we buy ourselves flexibility in building features without having to execute db migrations, can add features on beta.kolabnow.com on the same db as kolabnow.com etc.

I think there eventually will be usecases where we'll run into problems performance wise, e.g. if we start looking into client synchronizations of large sets of files. But at that point we're likely performance wise better of with building prupose built indexes (as in a db table that contains what we need),
for that specifically.

  • If we move file name, size, mimetype to a separate table whenever we want to list/sort/search we have to make join(s) which will make it slow, and I'd say listing/searching by these three attributes is essential. A simplest listing needs to return at least the file size, name and mimetype, this makes this basic query much complicated/slower.

The assumption is that most if not all "lists" of "files" do not contain a column for the attributes such as size, mtime nor mimetype, let alone allow for "sorting by".

Most? All file managers/pickers I know display file name, size and mtime (mimetype is often used to display a different icon).

But we're not after a traditional file manager.

A user is presumed to use Kolab Now Files with a purpose other than to look at sizes and modification timestamps.

There may be use-cases in which size and mtime are particularly useful, but those use-cases are not "all use-cases", and not unlikely have "additional requirements" or expectations as to what mtime for example represents.

"Recently uploaded photos" could come to mind, in order to tag/moderate/create album -- but afterwards, it's largely pointless.

"Recently progressed documents in workflows" could come to mind, but has little to do with the file mtime.

"Files to synchronize" comes to mind, but requires a program to use mtime, not the UI.

Furthermore, a "list of files" even with metadata does not require us to use a single roundtrip -- it can be a) a list of files and b) get some metadata -- avoiding joins.

Not if you sort by or search by these fields. And sorting by size/name/mtime is essential.

No, it's not.

For a Nautilus filesystem browser, yes. For a traditional file cloud, sure. But those things already exist and we need not pursue "yet another" implementation of those things.

That being said, where additional metadata such as size and/or mtime are to be included in the listing, we should probably consider the fs_items column values to be a derivative -- there could be many aspects that could justify updating mtime, such as changing permissions.

updating mtime is not very relevant in this context, imo.

  • If we wanted to re-use the existing SettingsTrait the table name would have to be fs_file_settings, or you insist on "attributes"?

fs_item_settings I suppose? 'attributes' would be more natural, can it simply inherit/extend SettingsTrait or something creative to avoid duplication of code?

Extending would not make sense, because getSettings() would become getAttributes(), etc.

Hmm, yes, that's right. Even though we could parent::getSettings() for an additional function getAttributes(), I suppose it's still not clean.

Should we entertain a different trait then? Would there (possibly) be enough distinction to separate them without considering it blunt code duplication?

Is there a purpose to, for example, not honour much of the HTTP protocol headers (content length, type, etc.?) and instead introduce something like a collectAttributes() which, for all intents and purposes, examines the file with whatever additional logic may be required? encoded geo-location in pictures, etc.?

  • If we move permissions to settings/attributes then a "search by share id" query will be very slow, as it will be a non-indexed query on a big table. Therefore I'd keep the fs_permissions table as is.

I don't understand how/why a search by share id would be very slow. For wallet transactions for example, we already search a table that is millions of rows per calendar year. This is all about the right indexes and cardinality.

Knowing how user_settings table looks like, how would you store permissions to make them easily/fast searchable by the permission identifier, please?

If jack shares a file with john, john will not have to search for files shared with him -- instead he gets a link for the occasion, which is already linked to both "him" and "the item id";

The lookup "key" therefore becomes the share-link-id -- prefixed some recognizable way in order to be able to get just these attributes;

item_id, key = 'share_<share-link-id1>', value = 'john@kolab.org'
item_id, key = 'share_<share-link-id2>', value = 'jane@kolab.org'
item_id, key = 'share_<share-link-id3>', value = 'jack@kolab.org'

If we prepare for a future, it could be json:

item_id, key = 'share_<share-link-id1>', value = '{"john@kolab.org":"r"}' -- for a regular file
item_id, key = 'share_<share-link-id2>', value = '{"john@kolab.org":"rw"}' -- for a collaboration directory
item_id, key = 'share_<share-link-id3>', value = '{"john@kolab.org":"w"}' -- for an "upload only" location?

We could also entertain JSON_SEARCH(), but that's probably not the most efficient;

item_id, key = 'permissions', value = '{"<share-link-id1>":"john@kolab.org"}' -- only applicable for read-only then ...
item_id, key = 'permissions', value = '{"john@kolab.org":{"id":"<share-link-id1>","access":"r"}}' -- i would not know how to query these ...
  • I understand fs_chunks needs a sequence column, but what's wrong with the incremented id? Ok, I guess with a generated id we can make it more performant, because we don't need to make an INSERT first. I.e. we'd generate chunk-id, save the chunk-file, then insert. Do you want to make the chunk id unique/primary key?

An auto-increment ID can not be influenced; it always goes up.

Further, in a MariaDB Galera environment, there's an additional increment for each server in the cluster. You can see this in the ticket numbers on this here Phabricator for example -- the auto-increment steps is 3.

But these aren't problems for the use-case we're working on here, imo. The uniqueness is the only thing we need. Do you want to make the chunk id unique/primary key if it's a generated uuid? I don't think it needs to be unique, file_id + chunk_id should be, but I wouldn't even check its existence before an INSERT.

Using auto-increment as the sequence id makes replacing a chunk difficult if the original needs to be preserved (for a time) -- if we're not using an auto-increment field that way, then the use of and reliance upon it would be blocking a potential future iteration.

  • @mollekopf, the file_id + sequence is not unique enough, there might be more than one deleted record with the same file_id + sequence combination. Especially considering file updates and partial uploads. So, I see a reason to have an extra id for the chunk filename.

What is unique is file_id, seq_num and deleted_at.

Not really. The soft-deleted chunks are to be deleted from the filesystem/db later, if for exampel the job fails, and you update the file multiple times you might end up with multple soft-deleted chunks of the file with same seq_num. Btw, I'd also store chunk size in this table (would be useful for partial uploads).

If you update the file multiple times, you end up with multiple soft-deleted chunks of the same item_id and seq_num, but with different deleted_at -- the trio can be unique constrained.

item_id, key = 'share_<share-link-id1>', value = '{"john@kolab.org":"r"}' -- for a regular file

So, how do you search fast by the key if you don't know item_id (which is the index here)? Should we just add an index on the key column alone?

machniak updated this revision to Diff 9976.Apr 15 2022, 1:15 PM
  • Files API schema refactoring
vanmeeuwen updated this revision to Diff 9982.Apr 15 2022, 2:34 PM

Rebase on beta-release

vanmeeuwen accepted this revision.Apr 15 2022, 2:39 PM
This revision is now accepted and ready to land.Apr 15 2022, 2:39 PM
Closed by commit rKd77c5d69f16a: [WIP] Files API (authored by Jeroen van Meeuwen (Apheleia IT) <vanmeeuwen@apheleia-it.ch>, committed by mollekopf). · Explain WhyThu, Apr 21, 1:33 PM
This revision was automatically updated to reflect the committed changes.