Discussion:
[Gramps-devel] DB-API code review
Nick Hall
2016-10-01 23:07:56 UTC
Permalink
Devs,

The code generally looks good, but not quite finished.

Firstly, I didn't look at the experimental API code in any detail. This
is part of GEPS 33 and currently unused. It needs to be moved into a
branch.

Next, I have a list of issues:

The autobackup method imports from cli, which is not allowed. The gen
module should be self-contained.

The public API has some new methods which need to be defined in the base
class. e.g. get_summary and get_schema_version

There are some public methods which need to be private or protected.
e.g. update_backlinks and update_secondary_values

BSDDB tables look like dictionaries. A Map class provides this
functionality in the DB-API code.

Only the import code accesses the database in this way. It would be
quite easy to rewrite the code to remove the need for the Map class. It
may be better to make this change now.

The get_*_from_gramps_id, has_*_handle, and get_raw_*_data methods use
the maps. This is not a good idea if we plan to get rid of them.

Why don't the get_raw_*_data methods just call the corresponding
_get_raw_*_data method? Do we plan do deprecate the raw methods?

BSDDB provides cursors to quickly iterate over data in a table. A Cursor
and TreeCursor class provides this functionality in the DB-API code.

The TreeCursor doesn't seem to have been implemented correctly. Cursors
are only used in the list/tree views. Do we want to replace them?

There are new has_handle_for_* methods which perform the same
functionality as has_*_handle. They should be removed. We don't need
duplicates.

There are new has_gramps_id_for_* methods. They should be renamed
has_*_gramps_id for consistency.

Do we really need the new get_*_gramps_ids methods?

Some groups of methods share repeated code. This needs tidying up.

My main concern is with the secondary fields. These are extra columns
that are added for indexing.

Some columns have long names such as
primary_name__surname_list__0__surname. This is rather messy.

There are some duplicates (in addition to duplicating information in the
serialised data). The person table has columns gender and gender_type;
surname and primary_name__surname_list__0__surname for example.

Columns that contain data from other tables are not updated. For
example the father_handle__primary_name__first_name column is not
updated when the name of the father changes.

Secondary columns are added automatically from a schema in the primary
objects. We need to index some fields such as gramps_id, but should we
index all top level fields with a simple type? What is the point of
indexing tag color or media checksum for example? Adding extra indexes
will slow down imports.

Indexing isn't as simple as creating an index for every column.
Sometimes it is better to index on multiple columns for example
(latitude, longitude) or (surname, first_name). In Gramps, the "sort
as" field also has to be taken into account for names.

The secondary field/column design needs to be reviewed. A simpler
implementation may be preferable.

I think that I have said enough for now. I await any comments.

Regards,


Nick.
John Ralls
2016-10-02 09:27:43 UTC
Permalink
Post by Nick Hall
Devs,
The code generally looks good, but not quite finished.
Firstly, I didn't look at the experimental API code in any detail. This
is part of GEPS 33 and currently unused. It needs to be moved into a
branch.
The autobackup method imports from cli, which is not allowed. The gen
module should be self-contained.
The public API has some new methods which need to be defined in the base
class. e.g. get_summary and get_schema_version
There are some public methods which need to be private or protected.
e.g. update_backlinks and update_secondary_values
BSDDB tables look like dictionaries. A Map class provides this
functionality in the DB-API code.
Only the import code accesses the database in this way. It would be
quite easy to rewrite the code to remove the need for the Map class. It
may be better to make this change now.
The get_*_from_gramps_id, has_*_handle, and get_raw_*_data methods use
the maps. This is not a good idea if we plan to get rid of them.
Why don't the get_raw_*_data methods just call the corresponding
_get_raw_*_data method? Do we plan do deprecate the raw methods?
BSDDB provides cursors to quickly iterate over data in a table. A Cursor
and TreeCursor class provides this functionality in the DB-API code.
The TreeCursor doesn't seem to have been implemented correctly. Cursors
are only used in the list/tree views. Do we want to replace them?
There are new has_handle_for_* methods which perform the same
functionality as has_*_handle. They should be removed. We don't need
duplicates.
There are new has_gramps_id_for_* methods. They should be renamed
has_*_gramps_id for consistency.
Do we really need the new get_*_gramps_ids methods?
Some groups of methods share repeated code. This needs tidying up.
My main concern is with the secondary fields. These are extra columns
that are added for indexing.
Some columns have long names such as
primary_name__surname_list__0__surname. This is rather messy.
There are some duplicates (in addition to duplicating information in the
serialised data). The person table has columns gender and gender_type;
surname and primary_name__surname_list__0__surname for example.
Columns that contain data from other tables are not updated. For
example the father_handle__primary_name__first_name column is not
updated when the name of the father changes.
Secondary columns are added automatically from a schema in the primary
objects. We need to index some fields such as gramps_id, but should we
index all top level fields with a simple type? What is the point of
indexing tag color or media checksum for example? Adding extra indexes
will slow down imports.
Indexing isn't as simple as creating an index for every column.
Sometimes it is better to index on multiple columns for example
(latitude, longitude) or (surname, first_name). In Gramps, the "sort
as" field also has to be taken into account for names.
The secondary field/column design needs to be reviewed. A simpler
implementation may be preferable.
I think that I have said enough for now. I await any comments.
Nick,

Thanks for the review. The implementation issues (duplication, copy-pasted code, improper imports, etc.) should be turned into bugs on the 5.0.0 roadmap for tracking and coordination of effort.

The database design issues concerning indexing and normalization [1] I think should be discussed a bit here first to figure out whether, when, and how to improve the design. Once we have that worked out, document the desired design in the wiki and write some bug reports detailing the changes needed to get to that design. Anything that would make migration difficult should go on the 5.0.0 roadmap; other items we can discuss and decide whether they can wait for 5.1.0.

Are these "secondary columns for indexing" really for indexing? IOW, is CREATE_INDEX or the BDB equivalent [2] run on the result, or are the columns simply duplicated out of the pickle to expose them for searching or to save unpickling the blob? IIRC we had a long discussion about the need for the latter last winter.

Cursors aren't quite the same in DB-API as in BDB: The latter is a very low-level API that can be used to implement a SQL database; two examples are BDB itself since version 5 and MySQL since version 3.23. In BSD a cursor is basically a seek(3) operation into the underlying key file. In SQL cursors generally work on result sets (which of course could be SELECT * to get a whole table or view). See PEP 249[3] and Wikipedia[4] for more information.

A general comment to get the discussion started: Our current design is based on a key (handle) and a pickled object, stored as a blob. This makes some sense for the low-level BDB API because it has two fields, the key used to find a record and the record itself. Python provides the pickle/unpickle API to easily serialize and deserialize objects, so why not? SQL works differently, and most of its power comes from having every field defined and visible to the database engine. The database engine can't see inside the pickle so can't make any use of that data. Converting to a SQL backend calls for a substantial redesign of the way that Gramps interacts with the backend. That doesn't have to be done for 5.0, in fact it makes sense to provide the backend first and then incrementally rework the rest to make use of the new capabilities. We'll have to accept that there may be a performance cost for large datasets in the meantime.

Regards,
John Ralls

[1] https://en.wikipedia.org/wiki/Database_normalization
[2] https://docs.oracle.com/cd/E17275_01/html/programmer_reference/am_second.html
[3] https://www.python.org/dev/peps/pep-0249/#cursor-objects
[4] https://en.wikipedia.org/wiki/Cursor_(databases)
Nick Hall
2016-10-02 14:22:20 UTC
Permalink
Post by John Ralls
The database design issues concerning indexing and normalization [1] I think should be discussed a bit here first to figure out whether, when, and how to improve the design. Once we have that worked out, document the desired design in the wiki and write some bug reports detailing the changes needed to get to that design. Anything that would make migration difficult should go on the 5.0.0 roadmap; other items we can discuss and decide whether they can wait for 5.1.0.
Are these "secondary columns for indexing" really for indexing? IOW, is CREATE_INDEX or the BDB equivalent [2] run on the result, or are the columns simply duplicated out of the pickle to expose them for searching or to save unpickling the blob? IIRC we had a long discussion about the need for the latter last winter.
Not all secondary columns are indexed.

In BSDDB, we only index the gramps_id fields and name field in the tag
table.

In DB-API, the gramps_id columns are indexed, but the name column is
not. In addition the surname and given_name columns are indexed.

The reference table is also indexed in both BSDDB and DB-API.

The person table has the following duplicated columns:

given_name
surname
gender_type
primary_name__first_name
primary_name__surname_list__0__surname
gender

All four name columns are indexed.

The family table has the following columns which are not updated when
the linked people are edited:

father_handle__primary_name__surname_list__0__surname
father_handle__primary_name__first_name
mother_handle__primary_name__surname_list__0__surname
mother_handle__primary_name__first_name

Are we happy with these log column names? Would it be better to use
aliases?

We don't seem to be using the secondary columns. Are they used by
gramps-connect?


Nick.
Nick Hall
2016-10-02 14:23:02 UTC
Permalink
Post by John Ralls
Cursors aren't quite the same in DB-API as in BDB: The latter is a very low-level API that can be used to implement a SQL database; two examples are BDB itself since version 5 and MySQL since version 3.23. In BSD a cursor is basically a seek(3) operation into the underlying key file. In SQL cursors generally work on result sets (which of course could be SELECT * to get a whole table or view). See PEP 249[3] and Wikipedia[4] for more information.
I actually meant to say a Gramps cursor, by which I mean an iterator
which iterates over raw data in a table.

https://gramps-project.org/wiki/index.php?title=Using_database_API#Iterating_through_the_database


I think that only the list/tree view code uses this type of cursor.

Do we still want this is 5.0?


Nick.
John Ralls
2016-10-02 15:32:52 UTC
Permalink
Post by Nick Hall
Post by John Ralls
Cursors aren't quite the same in DB-API as in BDB: The latter is a very low-level API that can be used to implement a SQL database; two examples are BDB itself since version 5 and MySQL since version 3.23. In BSD a cursor is basically a seek(3) operation into the underlying key file. In SQL cursors generally work on result sets (which of course could be SELECT * to get a whole table or view). See PEP 249[3] and Wikipedia[4] for more information.
I actually meant to say a Gramps cursor, by which I mean an iterator
which iterates over raw data in a table.
https://gramps-project.org/wiki/index.php?title=Using_database_API#Iterating_through_the_database
I think that only the list/tree view code uses this type of cursor.
Do we still want this is 5.0?
Nick,

That's just a thinly-disguised BDB cursor. From the cited wiki page: "The database can provide a cursor that will iterate through the database without having to load all handles into memory."

BDB can do that because it's a low-level library. SQL can't. With SQL you need to query the database and everything you ask for gets loaded into memory and you can deal with it as you please. The database API usually represents the results as some sort of iterable, depending upon the language containers and facilities. Doubly-linked lists are common representations.

An important take-away is that the data that a BDB cursor returns is live. You can manipulate it directly and then call write() on it and the new record will replace the old one (in place if it will fit, at the end of the file if not, BDB has internal code to manage that part). With a SQL result set any changes must be converted into an UPDATE query and run. If the only users of cursors are the list- and tree-views then that's not a consideration.

Note that the data displayed in the listviews (I'm going to just use that as a catch-all for "list- and tree-views"; easier to write and to read) has to be in memory anyway. Ideally we'd query for just the fields we want to display and load them directly into the listview's model.

Regards,
John Ralls
John Ralls
2016-10-02 15:46:32 UTC
Permalink
Post by John Ralls
Post by Nick Hall
Post by John Ralls
Cursors aren't quite the same in DB-API as in BDB: The latter is a very low-level API that can be used to implement a SQL database; two examples are BDB itself since version 5 and MySQL since version 3.23. In BSD a cursor is basically a seek(3) operation into the underlying key file. In SQL cursors generally work on result sets (which of course could be SELECT * to get a whole table or view). See PEP 249[3] and Wikipedia[4] for more information.
I actually meant to say a Gramps cursor, by which I mean an iterator
which iterates over raw data in a table.
https://gramps-project.org/wiki/index.php?title=Using_database_API#Iterating_through_the_database
I think that only the list/tree view code uses this type of cursor.
Do we still want this is 5.0?
Nick,
That's just a thinly-disguised BDB cursor. From the cited wiki page: "The database can provide a cursor that will iterate through the database without having to load all handles into memory."
BDB can do that because it's a low-level library. SQL can't. With SQL you need to query the database and everything you ask for gets loaded into memory and you can deal with it as you please. The database API usually represents the results as some sort of iterable, depending upon the language containers and facilities. Doubly-linked lists are common representations.
An important take-away is that the data that a BDB cursor returns is live. You can manipulate it directly and then call write() on it and the new record will replace the old one (in place if it will fit, at the end of the file if not, BDB has internal code to manage that part). With a SQL result set any changes must be converted into an UPDATE query and run. If the only users of cursors are the list- and tree-views then that's not a consideration.
I might add that some SQL implementations *can fake* the live data aspect with "updatable cursors", and that server-based databases may keep the result set in the server's memory instead of the client if the cursor declared a cursor on the query.
That would make a difference only if the server is on a different computer from the client. I don't think either is particularly germane to our usage.

Regards,
John Ralls
Nick Hall
2016-10-02 18:52:51 UTC
Permalink
Post by John Ralls
Note that the data displayed in the listviews (I'm going to just use that as a catch-all for "list- and tree-views"; easier to write and to read) has to be in memory anyway. Ideally we'd query for just the fields we want to display and load them directly into the listview's model.
The listviews only store the handle and sortkey. The column display
values are retrieved when required. There is a LRU cache to minimise
database access. I think it is 1000 rows by default.

I'll use the person view to illustrate my point about the secondary
columns. Suppose we wanted to populate a Gtk.ListStore from a SQL
query. What columns would we need?

A surname and given_name column isn't enough to construct any chosen
name format.

For the birth and death date and place we need to join to the event
table, but we don't have birth_handle and death_handle columns. Instead
we have birth_ref_index and death_ref_index.

Even if we could join to the event table, it has no date column. It does
have a column for the place handle.

The place table doesn't have a name column. A place can have multiple
names. It only has the legacy title column which many users may not use.

The tables don't have enough columns to provide even some basic
functionality. So why not just include the gramps_id and order_by
columns, both of which are indexed?

Next look at the family table. It does have father_handle and mother
handle columns, so we could join it to the person table to get the
parent's names. Why does it have father and mother surname and first
name fields? They are not even updated when a parent's name is edited.


Nick.
John Ralls
2016-10-02 20:21:31 UTC
Permalink
Post by John Ralls
Note that the data displayed in the listviews (I'm going to just use that as a catch-all for "list- and tree-views"; easier to write and to read) has to be in memory anyway. Ideally we'd query for just the fields we want to display and load them directly into the listview's model.
The listviews only store the handle and sortkey. The column display values are retrieved when required. There is a LRU cache to minimise database access. I think it is 1000 rows by default.
I'll use the person view to illustrate my point about the secondary columns. Suppose we wanted to populate a Gtk.ListStore from a SQL query. What columns would we need?
A surname and given_name column isn't enough to construct any chosen name format.
For the birth and death date and place we need to join to the event table, but we don't have birth_handle and death_handle columns. Instead we have birth_ref_index and death_ref_index.
Even if we could join to the event table, it has no date column. It does have a column for the place handle.
The place table doesn't have a name column. A place can have multiple names. It only has the legacy title column which many users may not use.
The tables don't have enough columns to provide even some basic functionality. So why not just include the gramps_id and order_by columns, both of which are indexed?
Next look at the family table. It does have father_handle and mother handle columns, so we could join it to the person table to get the parent's names. Why does it have father and mother surname and first name fields? They are not even updated when a parent's name is edited.
Nick,

I don't quite understand the point you're trying to make. The tables don't have the needed columns because we have all of the useful data buried in pickles where the database engine can't see it. Getting rid of the pickles is a prerequisite to my ideal case and to making good use of a SQL database. If we're going to keep pickling then we should simply dump DB-API altogether. It's stupid to have all of that overhead when all we want is a key-value store.

Mind, I'm advocating for keeping SQL and dropping pickling: I'm confident that a correctly designed and used SQL backend will make the rest of Gramps easier to maintain and easier to improve.

Regards,
John Ralls
Nick Hall
2016-10-02 21:26:22 UTC
Permalink
Post by John Ralls
I don't quite understand the point you're trying to make. The tables don't have the needed columns because we have all of the useful data buried in pickles where the database engine can't see it. Getting rid of the pickles is a prerequisite to my ideal case and to making good use of a SQL database. If we're going to keep pickling then we should simply dump DB-API altogether. It's stupid to have all of that overhead when all we want is a key-value store.
Mind, I'm advocating for keeping SQL and dropping pickling: I'm confident that a correctly designed and used SQL backend will make the rest of Gramps easier to maintain and easier to improve.
I have no problem with getting rid of pickles.

What I am trying to understand is where we are heading. Are we aiming
for a normalised relational model, or perhaps a json object store, or a
hybrid design, or something else?

If 5.0 is still going to contain pickles, why do we need secondary
columns that are not indexed? They don't seem to be particularly useful
at the moment.


Nick.
Nick Hall
2016-10-03 18:25:30 UTC
Permalink
We have to decide where to head. I'd prefer a normalized relational model. Well, actually I'd prefer a document model like NoSQL (MongoDB and friends), but I think that's too hard to implement in a desktop application. What do*you* want? What do other devs want?
I'll start another thread to discuss this.
But if we're going to ship 5.0 anytime soon we've got to work on polishing up what we have now and that's a slightly hybridized key-value store with some columns duplicated for indexing and maybe performance.
I agree.
It looks to me like the (father|mother)_handle__primary_name__.* fields are used for sorting family handles in get_family_handles() starting at gramps/plugins/db/dbapi/dbapi.py line 516, but I don't see where those columns are created and that's the only place grep finds "primary_name__" (and that code is called only if sort_handles is true, the only call for which is the Family Group Report). Grep finds no instance of (father|mother)_handle__primary_name__.* anywhere. Where did you find them?
The secondary columns are added dynamically using 'ALTER TABLE ADD
COLUMN' in the rebuild_secondary_fields method. This calls
get_secondary_fields in gen/lib/tableobj.py, which in turn calls
get_schema and get_extra_secondary fields in gen/lib/person.py and
gen/lib/family.py. The long field names are actually defined using a
dotted notation which is why a grep didn't find them.
I have created a wiki page with the output from ".schema --indent"
using sqlite3.
https://gramps-project.org/wiki/index.php?title=SQL_Schema
The query used to sort the family handles actually joins to the
person table and uses primary_name__first_name and
primary_name__surname_list__0__surname, which are duplicates of
given_name and surname.
What exactly do you mean by "secondary columns"?
Any column that contains the contents of a field.
The person table is indexed on surname, given_name, order_by, gramps_id, and (down at line 319) handle. The last one is a no-op, handle is declared as the PK in the CREATE TABLE so it's already the fastest way to access the table. I
Your post appears to be cut short.
I can see why the primary object tables have columns for handle,
blob_data, gramps_id and in some cases order_by. The person table
needs given_name and surname columns and the family table needs
father_handle and mother_handle columns. Do we need any more for 5.0?
Nick,
Did you mean for this to be off list?
No, not cut short, just a stray keystroke.
So those "extra secondary fields" in person.py and family.py are also
index_fields whose indices are created by create_secondary_indexes()
at the end of rebuild_secondary_fields, so to the indices listed above
we can add these. I expect that they're all there for performance
improvement. Note that all of the secondary fields are rebuilt at
load-time. If they're purely for performance then Doug probably
figured that was sufficient since they'd seldom be changed in normal
usage.
That leaves gender_type, which seems to be unused. Other than that,
the answer to "do we need any more" would seem to be "no, but users
with large databases may see a performance hit without some of the
others".
Regards,
John Ralls
John,

Your last reply was not sent to the list. I have copied this back onto
the list again.

Yes. The extra secondary fields defined at the moment are also indexed.
They are default search fields in gramps-connect.

These fields are don't appear to be rebuilt at load-time. I just loaded
the example database and changed the father's surname in F0017. The
surname column in the person table was updated, but the column in the
family tables was not. Then I re-started Gramps. The surname in the
family table was still incorrect.

The father and mother name columns in the family table should not be
necessary for performance. A table join should be efficient. This is
what relational databases are particularly good at.

I think we can also remove the gender_type, surname and given_name
fields since they are duplicates.

Nick.
John Ralls
2016-10-03 20:37:08 UTC
Permalink
We have to decide where to head. I'd prefer a normalized relational model. Well, actually I'd prefer a document model like NoSQL (MongoDB and friends), but I think that's too hard to implement in a desktop application. What do *you* want? What do other devs want?
I'll start another thread to discuss this.
But if we're going to ship 5.0 anytime soon we've got to work on polishing up what we have now and that's a slightly hybridized key-value store with some columns duplicated for indexing and maybe performance.
I agree.
It looks to me like the (father|mother)_handle__primary_name__.* fields are used for sorting family handles in get_family_handles() starting at gramps/plugins/db/dbapi/dbapi.py line 516, but I don't see where those columns are created and that's the only place grep finds "primary_name__" (and that code is called only if sort_handles is true, the only call for which is the Family Group Report). Grep finds no instance of (father|mother)_handle__primary_name__.* anywhere. Where did you find them?
The secondary columns are added dynamically using 'ALTER TABLE ADD COLUMN' in the rebuild_secondary_fields method. This calls get_secondary_fields in gen/lib/tableobj.py, which in turn calls get_schema and get_extra_secondary fields in gen/lib/person.py and gen/lib/family.py. The long field names are actually defined using a dotted notation which is why a grep didn't find them.
I have created a wiki page with the output from ".schema --indent" using sqlite3.
https://gramps-project.org/wiki/index.php?title=SQL_Schema <https://gramps-project.org/wiki/index.php?title=SQL_Schema>
The query used to sort the family handles actually joins to the person table and uses primary_name__first_name and primary_name__surname_list__0__surname, which are duplicates of given_name and surname.
What exactly do you mean by "secondary columns"?
Any column that contains the contents of a field.
The person table is indexed on surname, given_name, order_by, gramps_id, and (down at line 319) handle. The last one is a no-op, handle is declared as the PK in the CREATE TABLE so it's already the fastest way to access the table. I
Your post appears to be cut short.
I can see why the primary object tables have columns for handle, blob_data, gramps_id and in some cases order_by. The person table needs given_name and surname columns and the family table needs father_handle and mother_handle columns. Do we need any more for 5.0?
Nick,
Did you mean for this to be off list?
No, not cut short, just a stray keystroke.
So those "extra secondary fields" in person.py and family.py are also index_fields whose indices are created by create_secondary_indexes() at the end of rebuild_secondary_fields, so to the indices listed above we can add these. I expect that they're all there for performance improvement. Note that all of the secondary fields are rebuilt at load-time. If they're purely for performance then Doug probably figured that was sufficient since they'd seldom be changed in normal usage.
That leaves gender_type, which seems to be unused. Other than that, the answer to "do we need any more" would seem to be "no, but users with large databases may see a performance hit without some of the others".
Regards,
John Ralls
John,
Nick,
Your last reply was not sent to the list. I have copied this back onto the list again.
That's because the letter that I replied to wasn't sent to the list, hence the question at the top. No matter.
Yes. The extra secondary fields defined at the moment are also indexed. They are default search fields in gramps-connect.
These fields are don't appear to be rebuilt at load-time. I just loaded the example database and changed the father's surname in F0017. The surname column in the person table was updated, but the column in the family tables was not. Then I re-started Gramps. The surname in the family table was still incorrect.
That's likely a bug. It seems from the code in dbapi.py that they're intended to be rebuilt.
The father and mother name columns in the family table should not be necessary for performance. A table join should be efficient. This is what relational databases are particularly good at.
Well, joins can have performance implications but probably negligible at the scale we're working with. That might be true of indices as well. OTOH I think that Doug wouldn't have done the work if he hadn't observed a problem.
I think we can also remove the gender_type, surname and given_name fields since they are duplicates.
ISTR that surname and given_name were indexed to make populating the person listview and treeview faster. Those don't scale well in Gtk3; Enno has complained about performance since 4.0 and in fact continues to use 3.4 because of poor performance with his database. I've no doubt that there are others with the same problem.

Regards,
John Ralls
Nick Hall
2016-10-03 21:00:09 UTC
Permalink
Post by John Ralls
Your last reply was not sent to the list. I have copied this back onto the list again.
That's because the letter that I replied to wasn't sent to the list,
hence the question at the top. No matter.
Sorry, that was my fault. I sent the first message to you using the
wrong email address and it was rejected by the list. Then I sent it to
the list again using the correct address.

Nick.
Nick Hall
2016-10-03 21:06:18 UTC
Permalink
Post by John Ralls
Post by Nick Hall
The father and mother name columns in the family table should not be
necessary for performance. A table join should be efficient. This
is what relational databases are particularly good at.
Well, joins can have performance implications but probably negligible
at the scale we're working with. That might be true of indices as
well. OTOH I think that Doug wouldn't have done the work if he hadn't
observed a problem.
The code for the sorted family handles that you mentioned earlier uses a
join rather than these columns.

Nick.
John Ralls
2016-10-03 21:19:23 UTC
Permalink
Post by John Ralls
The father and mother name columns in the family table should not be necessary for performance. A table join should be efficient. This is what relational databases are particularly good at.
Well, joins can have performance implications but probably negligible at the scale we're working with. That might be true of indices as well. OTOH I think that Doug wouldn't have done the work if he hadn't observed a problem.
The code for the sorted family handles that you mentioned earlier uses a join rather than these columns.
Yes, it does. Plus I couldn't find any evidence that those indexed fields are actually used anywhere... so why did Doug do all of that work?

Regards,
John Ralls
Nick Hall
2016-10-04 17:56:50 UTC
Permalink
Post by John Ralls
Post by John Ralls
The father and mother name columns in the family table should not be necessary for performance. A table join should be efficient. This is what relational databases are particularly good at.
Well, joins can have performance implications but probably negligible at the scale we're working with. That might be true of indices as well. OTOH I think that Doug wouldn't have done the work if he hadn't observed a problem.
The code for the sorted family handles that you mentioned earlier uses a join rather than these columns.
Yes, it does. Plus I couldn't find any evidence that those indexed fields are actually used anywhere... so why did Doug do all of that work?
Good question.

He has created a mechanism for extracting data from an object using a
django-inspired notation. Together with a schema, this can be used to
create secondary columns. I can see why this could be useful for
indexing and accessing data outside of the blob.

However, I think we are soon going to meet limitations with this
approach. How many secondary columns will we create? How would we
represent a list of attributes?

There are other ways of exposing the object fields without duplicating data.

1. We could store the objects as json instead of blobs.

2. We could create a column for each top-level field. Some of these
fields, such as primary name or event_ref_list, would contain blobs.
This approach would allow us to use the relational features of the
database such as foreign key constraints. In the future we could create
extra tables for attributes or event references for example.

We could also consider a hybrid of these two approaches.

This is why it is important to know where we are heading. We seem to be
going down one route, but we may prefer another.

We probably don't want to include surname and first_name columns in the
family table at this stage.


Nick.
Tim Lyons
2016-10-04 22:25:40 UTC
Permalink
Post by Nick Hall
There are other ways of exposing the object fields without duplicating data.
1. We could store the objects as json instead of blobs.
2. We could create a column for each top-level field. Some of these
fields, such as primary name or event_ref_list, would contain blobs.
This approach would allow us to use the relational features of the
database such as foreign key constraints. In the future we could create
extra tables for attributes or event references for example.
We could also consider a hybrid of these two approaches.
This is why it is important to know where we are heading. We seem to be
going down one route, but we may prefer another.
I'm not sure what route you think we are going down?

Would you (plural) like to suggest some possible choices of route? I suppose
the baseline would be just to emulate the BSDDB database in DBAPI (i.e. just
have a key and blob database); am I right in thinking that this is how DBAPI
is used in the current master branch?

John suggests this would entail extra overhead for a higher level
abstraction and gain nothing from it. I wondered whether we might gain
reliability from a more easily understood (and hence verifiable) level of
abstraction, but John responded that "if reliability was the main goal I'd
think that understanding why user's DbEnvs get screwed up and fixing it
would be more profitable".


So, that's just for starters, what other routes do you have in mind?

Tim.



--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677028.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
Nick Hall
2016-10-04 23:04:43 UTC
Permalink
Post by Tim Lyons
I'm not sure what route you think we are going down?
At the moment we are storing pickled objects as blobs. In addition, we
are duplicating some of the fields in the blobs as extra columns.
Post by Tim Lyons
Would you (plural) like to suggest some possible choices of route? I suppose
the baseline would be just to emulate the BSDDB database in DBAPI (i.e. just
have a key and blob database); am I right in thinking that this is how DBAPI
is used in the current master branch?
The options that I have suggested are:

1. Store the objects as json rather than blobs. The Sqlite json
functionality would allow us to index fields within the json and extract
fields from it. No extra columns would be necessary. See:

https://www.sqlite.org/json1.html

2. Move towards a normalised model. We could start by expanding out the
top level of each object. Extra tables could be added to expose more of
the objects in columns. This would allow us to exploit the relational
features of the database.

As the number of joins increases we would probably have to implement
delayed data retrieval at some point.

3. A hybrid of the above two suggestions.


Nick.
Tim Lyons
2016-10-05 11:22:20 UTC
Permalink
Post by Nick Hall
Post by Tim Lyons
I'm not sure what route you think we are going down?
At the moment we are storing pickled objects as blobs. In addition, we
are duplicating some of the fields in the blobs as extra columns.
Are the extra columns used in Gramps master?

Are the extra columns used in gramps-connect (or does that just use other
extra created columns)?



--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677041.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
Nick Hall
2016-10-05 14:33:34 UTC
Permalink
Post by Tim Lyons
Post by Nick Hall
At the moment we are storing pickled objects as blobs. In addition, we
Post by Nick Hall
are duplicating some of the fields in the blobs as extra columns.
Are the extra columns used in Gramps master?
Some of them. We need the handle, gramps_id and order_by columns. The
family handle sorting also uses the father_handle, mother_handle,
primary_name__first_name and primary_name__surname_list__0__surname columns.
Post by Tim Lyons
Are the extra columns used in gramps-connect (or does that just use other
extra created columns)?
Probably, but I haven't examined the gramps-connect code in any detail.


Nick.
Tim Lyons
2016-10-05 22:19:28 UTC
Permalink
Post by Nick Hall
Post by Tim Lyons
Post by Nick Hall
At the moment we are storing pickled objects as blobs. In addition, we
Post by Nick Hall
are duplicating some of the fields in the blobs as extra columns.
Are the extra columns used in Gramps master?
Some of them. We need the handle, gramps_id and order_by columns. The
family handle sorting also uses the father_handle, mother_handle,
primary_name__first_name and primary_name__surname_list__0__surname columns.
The values that are stored in the order_by columns (and others?) seem to be
constructed from a formula that depends on locale. (1) What happens if I
open my gramp database with a different locale from last time? (2) What
happens if I run a report using a different locale from the main locale of
the gramps application?

(3) Does the existing code allow for this? (4) If you have to construct new
columns whenever the locale change, does it make it much less worthwhile
storing the columns in the first place?

(I know that the answers to (1, 2) and (4) in principle are "the code would
have to allow for changes to the locale" and "it is a tradeoff between cost
of set-up and number of times the column is used", but I hope for more than
in principle!). What is the answer to (3)?

Tim.



--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677048.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
Nick Hall
2016-10-06 16:29:53 UTC
Permalink
Post by Tim Lyons
The values that are stored in the order_by columns (and others?) seem to be
constructed from a formula that depends on locale. (1) What happens if I
open my gramp database with a different locale from last time? (2) What
happens if I run a report using a different locale from the main locale of
the gramps application?
(3) Does the existing code allow for this? (4) If you have to construct new
columns whenever the locale change, does it make it much less worthwhile
storing the columns in the first place?
(I know that the answers to (1, 2) and (4) in principle are "the code would
have to allow for changes to the locale" and "it is a tradeoff between cost
of set-up and number of times the column is used", but I hope for more than
in principle!). What is the answer to (3)?
A better approach would be to define a collation. In Sqlite, the
command would be:

connection.create_collation("glocale", glocale.strcoll)

Then we could perform a select like:

SELECT handle FROM source ORDER BY title COLLATE glocale;

There is no need to maintain a separate order_by column in most cases.
The person table would need a surname list column.

In Postgres, a collation is created using SQL:

CREATE COLLATION glocale (LOCALE = 'en_GB.UTF8');

I don't know the MySql equivalent.


Nick.
Tim Lyons
2016-10-06 17:18:18 UTC
Permalink
Post by Nick Hall
Post by Tim Lyons
The values that are stored in the order_by columns (and others?) seem to be
constructed from a formula that depends on locale. (1) What happens if I
open my gramp database with a different locale from last time? (2) What
happens if I run a report using a different locale from the main locale of
the gramps application?
(3) Does the existing code allow for this? (4) If you have to construct new
columns whenever the locale change, does it make it much less worthwhile
storing the columns in the first place?
(I know that the answers to (1, 2) and (4) in principle are "the code would
have to allow for changes to the locale" and "it is a tradeoff between cost
of set-up and number of times the column is used", but I hope for more than
in principle!). What is the answer to (3)?
A better approach would be to define a collation. In Sqlite, the
connection.create_collation("glocale", glocale.strcoll)
SELECT handle FROM source ORDER BY title COLLATE glocale;
There is no need to maintain a separate order_by column in most cases.
The person table would need a surname list column.
CREATE COLLATION glocale (LOCALE = 'en_GB.UTF8');
I don't know the MySql equivalent.
So, are you saying the existing code needs fixing for this? (Better approach
could mean it is OK-ish as it is but there is a more elegant approach, or it
could mean it is not OK and some other approach is needed of which this is a
better one). If it's not OK, would you like to add it to your list in the
original message in this thread? Should I raise a bug for it, or will you?

BTW, I don't think most of the existing tables are defined to include the
appropriate field that is used for the order_by, e.g. page for citations,
tag.name for tag etc, so these would all need to be added (in the place of
the order_by field).

Tim.



--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677065.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
Nick Hall
2016-10-06 18:52:34 UTC
Permalink
Post by Tim Lyons
Post by Tim Lyons
The values that are stored in the order_by columns (and others?) seem to
Post by Tim Lyons
be
constructed from a formula that depends on locale. (1) What happens if I
open my gramp database with a different locale from last time? (2) What
happens if I run a report using a different locale from the main locale of
the gramps application?
(3) Does the existing code allow for this? (4) If you have to construct new
columns whenever the locale change, does it make it much less worthwhile
storing the columns in the first place?
(I know that the answers to (1, 2) and (4) in principle are "the code would
have to allow for changes to the locale" and "it is a tradeoff between cost
of set-up and number of times the column is used", but I hope for more than
in principle!). What is the answer to (3)?
A better approach would be to define a collation. In Sqlite, the
connection.create_collation("glocale", glocale.strcoll)
SELECT handle FROM source ORDER BY title COLLATE glocale;
There is no need to maintain a separate order_by column in most cases.
The person table would need a surname list column.
CREATE COLLATION glocale (LOCALE = 'en_GB.UTF8');
I don't know the MySql equivalent.
So, are you saying the existing code needs fixing for this? (Better approach
could mean it is OK-ish as it is but there is a more elegant approach, or it
could mean it is not OK and some other approach is needed of which this is a
better one). If it's not OK, would you like to add it to your list in the
original message in this thread? Should I raise a bug for it, or will you?
Please create a bug report for this. I don't expect many users will
change locale, but it is a bug.
Post by Tim Lyons
BTW, I don't think most of the existing tables are defined to include the
appropriate field that is used for the order_by, e.g. page for citations,
tag.name for tag etc, so these would all need to be added (in the place of
the order_by field).
Columns already exist for source title, citation page and tag name.
They are created dynamically using "ALTER TABLE ADD COLUMN". See:

https://gramps-project.org/wiki/index.php?title=SQL_Schema

The person table will need a surname list column.

I don't understand why places are sorted by the integer place type. It
seems to be an attempt to order by hierarchy level in the tree cursor,
but this won't work. The integer values of place types do not
correspond to a level in the hierarchy.


Nick.
paul womack
2016-10-07 08:02:57 UTC
Permalink
Post by Nick Hall
In Sqlite, the
connection.create_collation("glocale", glocale.strcoll)
SELECT handle FROM source ORDER BY title COLLATE glocale;
If the data is in SQL columns... :

I understand that single-table "filters" can be implemented
simply and VERY efficiently in SQL

Multi-table joined queries can also be implemented,
are are fairly simple (no worse than what it takes
in Python at the moment, TBH).

These queries would most likely be far quicker
than at present, since modern SQL query optimisation
is nigh on magic.

But the relationship based queries (speaking
generally, the graph-theoretic sort of queries)
are hard-to-impossible in SQL, and regular expressions,
and pleasing stuff like

http://search.cpan.org/~brianl/Lingua-EN-Nickname-1.16/Nickname.pm

are not going to happen in SQL, and they're rather key
to genealogical software.

It is easy to cherry pick examples so that Python searching
is clearly the "only sensible" answer, or so that SQL is the
"only sensible" answer.

If there are libraries that store Python hashes back to SQL
in a columnar way, I would regard that as the most desirable
implementation - SQL searching could then be added piecemeal,
as an performance option.

BugBear
Tom Hughes
2016-10-07 08:40:05 UTC
Permalink
Post by paul womack
But the relationship based queries (speaking
generally, the graph-theoretic sort of queries)
are hard-to-impossible in SQL, and regular expressions,
and pleasing stuff like
Graph type queries are perfectly possible in any database that supports
SQL99 recursive common table expressions. For example here's one to find
all birth events taking place in place 2853 and it's descendants:

WITH RECURSIVE descendant_places AS (
SELECT place_id
FROM place_parents
WHERE parent_id = 2853
UNION
SELECT place_parents.place_id
FROM place_parents, descendant_places
WHERE place_parents.parent_id = descendant_places.place_id
)
SELECT events.*
FROM events
INNER JOIN descendant_places ON events.place_id = descendant_places.place_id
WHERE events.type = 'Birth';

That takes about 9ms to find 25 matching records in a postgresql
database extracted from my gramps database that contains about 7000
places and 17000 events.

Place 2853 is Gloucestershire and there are 40 other places descended
from it though a number of levels.

Pushing up one more level to England with nearly 6000 descendant places
and it takes 16ms to find 3482 events.

SQLite supports CTEs since 3.8.3 according to wikipedia, postgres since
8.4 and MariaDB since 10.2 although Oracle MySQL apparently doesn't.

Tom
--
Tom Hughes (***@compton.nu)
http://compton.nu/
paul womack
2016-10-07 08:52:19 UTC
Permalink
(bottom posted...)
Post by Tom Hughes
Post by paul womack
But the relationship based queries (speaking
generally, the graph-theoretic sort of queries)
are hard-to-impossible in SQL, and regular expressions,
and pleasing stuff like
WITH RECURSIVE descendant_places AS (
SELECT place_id
FROM place_parents
WHERE parent_id = 2853
UNION
SELECT place_parents.place_id
FROM place_parents, descendant_places
WHERE place_parents.parent_id = descendant_places.place_id
)
SELECT events.*
FROM events
INNER JOIN descendant_places ON events.place_id = descendant_places.place_id
WHERE events.type = 'Birth';
That takes about 9ms to find 25 matching records in a postgresql database extracted from my gramps database that contains about 7000 places and 17000 events.
Place 2853 is Gloucestershire and there are 40 other places descended from it though a number of levels.
Pushing up one more level to England with nearly 6000 descendant places and it takes 16ms to find 3482 events.
SQLite supports CTEs since 3.8.3 according to wikipedia, postgres since 8.4 and MariaDB since 10.2 although Oracle MySQL apparently doesn't.
That's most interesting, and something I didn't know. I presume that your statement
"Graph type queries are perfectly possible" should read "SOME Graph type queries are perfectly possible"...

In particular, in a tree drawing program I developed, I have arguments
for a "root person" and an argument for "maximum related distance from the root person"
which I use to draw small, localised trees from a larger DB.

To implement this I use

https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm

which I suspect would be hard to express in SQL.

There is also the question on how, in Gramps, we give the user access to these sort of queries.

I have worked professionally on a product where SQL queries where held in an object
oriented form, and could (indeed) be combined in arbitrary and
flexible ways (we called them "query fragments"), but the queries were much more
limited than we're discussing at the moment,

BugBear
Tom Hughes
2016-10-07 09:04:45 UTC
Permalink
Post by paul womack
That's most interesting, and something I didn't know. I presume that your statement
"Graph type queries are perfectly possible" should read "SOME Graph type
queries are perfectly possible"...
In particular, in a tree drawing program I developed, I have arguments
for a "root person" and an argument for "maximum related distance from the root person"
which I use to draw small, localised trees from a larger DB.
There are some examples in the Postgres documentation that show how to
track the depth and record with each result row how deep in the search
it was and I think you should be able to terminate at a given depth as
well. See here:

https://www.postgresql.org/docs/9.6/static/queries-with.html

I think that something like this would work to limit the search to a
depth of three:

WITH RECURSIVE "descendant_places" AS (
SELECT place_id, 1 AS depth
FROM place_parents
WHERE parent_id = 2853
UNION
SELECT place_parents.place_id, descendant_places.depth + 1 AS depth
FROM place_parents, descendant_places
WHERE place_parents.parent_id = descendant_places.place_id
AND descendant_places.depth < 4
)

Note that I think this will produce duplicate records if you reach the
same place by more than one path with different lengths.

The examples in the Postgres doco even show how to record the paths and
detect cycles although that relies on Postgres arrays which aren't
standard SQL.

Tom
--
Tom Hughes (***@compton.nu)
http://compton.nu/
Gerald Britton
2016-10-07 11:23:35 UTC
Permalink
Fwiw the LIKE operator does simple wildcard searches in SQL. Also SQL
server supports recursive ctes and has an optional component, MDS, that has
advanced matching algorithms. Some rdbms support regular expressions out
of the box.

Note that Oracle, Microsoft and IBM all have free versions of their
flagship products.
Post by Tom Hughes
Post by paul womack
That's most interesting, and something I didn't know. I presume that your statement
"Graph type queries are perfectly possible" should read "SOME Graph type
queries are perfectly possible"...
In particular, in a tree drawing program I developed, I have arguments
for a "root person" and an argument for "maximum related distance from the root person"
which I use to draw small, localised trees from a larger DB.
There are some examples in the Postgres documentation that show how to
track the depth and record with each result row how deep in the search
it was and I think you should be able to terminate at a given depth as
https://www.postgresql.org/docs/9.6/static/queries-with.html
I think that something like this would work to limit the search to a
WITH RECURSIVE "descendant_places" AS (
SELECT place_id, 1 AS depth
FROM place_parents
WHERE parent_id = 2853
UNION
SELECT place_parents.place_id, descendant_places.depth + 1 AS depth
FROM place_parents, descendant_places
WHERE place_parents.parent_id = descendant_places.place_id
AND descendant_places.depth < 4
)
Note that I think this will produce duplicate records if you reach the
same place by more than one path with different lengths.
The examples in the Postgres doco even show how to record the paths and
detect cycles although that relies on Postgres arrays which aren't
standard SQL.
Tom
--
http://compton.nu/
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Gramps-devel mailing list
https://lists.sourceforge.net/lists/listinfo/gramps-devel
Tim Lyons
2016-10-05 11:48:52 UTC
Permalink
Post by Nick Hall
1. Store the objects as json rather than blobs. The Sqlite json
functionality would allow us to index fields within the json and extract
https://www.sqlite.org/json1.html
2. Move towards a normalised model. We could start by expanding out the
top level of each object. Extra tables could be added to expose more of
the objects in columns. This would allow us to exploit the relational
features of the database.
As the number of joins increases we would probably have to implement
delayed data retrieval at some point.
3. A hybrid of the above two suggestions.
For any of these, how would the new features be used in Gramps 5.0? IWO,
what are the benefits of the change?


In the case of (1):

(a) it seems that it would not provide an easy way to use the data in the
json, see
https://www.sqlite.org/json1.html#examples_using_json_each_and_json_tree_
which seems a very complicated query mechanism, not the simple SELECT WHERE
that people are expecting as a benefit of moving to SQL.

(b) Would storing the data as json support joins of the data in the json?

In the case of (2), would part normalised be a benefit (compared with full
normalised)? How would part normalised be used in the actual Gramps code?


In any case, I don't think it we should make changes unless it can be
demonstrated by usage that the changes actually work. At the moment "Plus I
couldn't find any evidence that those indexed fields are actually used
anywhere... so why did Doug do all of that work?" [1] Are the extra columns
used at all? As Nick says in
http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4676995.html,
it seems that the actual columns provided are not particularly useful for
listviews.




I'm not arguing for or against anything (except to be sure that what we do
is actually useful in whatever we release) I am just trying to get
information.

Regards,
Tim.




[1] Nick says that "The code for the sorted family handles that you
mentioned earlier uses a join rather than these columns.", so what is the
join on? does it use some of the extra columns?




--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677042.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
John Ralls
2016-10-05 14:18:25 UTC
Permalink
Post by Tim Lyons
Nick says that "The code for the sorted family handles that you
mentioned earlier uses a join rather than these columns.", so what is the
join on? does it use some of the extra columns?
https://github.com/gramps-project/gramps/blob/master/gramps/plugins/db/dbapi/dbapi.py#L523

Regards,
John Ralls
Nick Hall
2016-10-05 16:03:28 UTC
Permalink
Post by Tim Lyons
For any of these, how would the new features be used in Gramps 5.0? IWO,
what are the benefits of the change?
I'm not proposing any such change for 5.0. I just want to know where we
are heading.
Post by Tim Lyons
(a) it seems that it would not provide an easy way to use the data in the
json, see
https://www.sqlite.org/json1.html#examples_using_json_each_and_json_tree_
which seems a very complicated query mechanism, not the simple SELECT WHERE
that people are expecting as a benefit of moving to SQL.
Yes. There queries are more complex, but it would be easier to
implement than a fully normalised model.
Post by Tim Lyons
(b) Would storing the data as json support joins of the data in the json?
I'm not quite sure what you mean. Table joins would be possible.
Post by Tim Lyons
In the case of (2), would part normalised be a benefit (compared with full
normalised)? How would part normalised be used in the actual Gramps code?
A normalised model would allow us to apply foreign key constraints.

It has been proposed that in the future we use indexes to increase
performance. An example might be helpful here. Suppose we want to find
all people who are butchers in the database, and the occupations are
stored as attributes.

A the moment we apply a filter which matches a rule against every person
in the database. How would we index the data?

Sqlite doesn't understand the data in a blob, so with our current design
we can't create an index. If we used json to store the objects, then we
could create an index on a json expression. The alternative would be to
create an attribute table containing key and value columns that could be
indexed.

If we didn't want to duplicate data we could store each top-level field
in a separate column. We could then remove the column containing the
attribute list as a blob. The problem is that the more the model is
normalised, the slower it is to read a complete object.

Is our data actually relational in nature? I'm sure some of our users
would like to see a fully relational model, but how many will be reading
the database directly?

Of course we could always add an index to our existing BSDDB database.

What are we trying to achieve?


Nick.
John Ralls
2016-10-05 20:55:21 UTC
Permalink
Post by Nick Hall
What are we trying to achieve?
An infinitely fast, incorruptibly reliable, universally portable storage solution.

That's not attainable in the real world so we have to be good engineers and figure out the tradeoffs to make the best compromise possible.

We *don't* have to do it in one release cycle, but we do want to set a goal and make progress toward it each cycle.

It's too late for any of that for 5.0 if it's going to be released before the end of this year, so perhaps we should just leave the backend design where it is for now and get the rest in shape for a release. After that we can spend a few months working out whether fully relational or relational-with-JSON works better. If we're going to stick to a major-release-every-spring schedule maybe we do that in branches and evaluate the results then merge the branch we like into master after 5.1.

The fact that it's damn hard to get major work done in a year is one reason why GnuCash has a 4-year major release cycle. It's also why downstream developers have been bitching incessantly about the instability of Gtk3 with its 6-month feature release cycle and why they've decided to switch to a 2-3 year more-or-less stable cycle.

Regards,
John Ralls
Nick Hall
2016-10-06 16:57:17 UTC
Permalink
Post by John Ralls
Post by Nick Hall
What are we trying to achieve?
An infinitely fast, incorruptibly reliable, universally portable storage solution.
That's not attainable in the real world so we have to be good engineers and figure out the tradeoffs to make the best compromise possible.
We*don't* have to do it in one release cycle, but we do want to set a goal and make progress toward it each cycle.
Exactly. I was hoping that we could agree on a broad direction for the
project, but it's not essential.
Post by John Ralls
It's too late for any of that for 5.0 if it's going to be released before the end of this year, so perhaps we should just leave the backend design where it is for now and get the rest in shape for a release. After that we can spend a few months working out whether fully relational or relational-with-JSON works better. If we're going to stick to a major-release-every-spring schedule maybe we do that in branches and evaluate the results then merge the branch we like into master after 5.1.
I think we only need to make a few minor changes.

1. Remove the duplicate gender_type, surname and given_name columns.
2. Remove the family name columns. There is a bug updating them and the
information is already available using a join.
3. Use a collation rather than the order_by columns.
Post by John Ralls
The fact that it's damn hard to get major work done in a year is one reason why GnuCash has a 4-year major release cycle. It's also why downstream developers have been bitching incessantly about the instability of Gtk3 with its 6-month feature release cycle and why they've decided to switch to a 2-3 year more-or-less stable cycle.
We could consider moving to a 2 year major release cycle.


Nick.
Paul Franklin
2016-10-06 19:40:22 UTC
Permalink
Post by Nick Hall
I think we only need to make a few minor changes.
1. Remove the duplicate gender_type, surname and given_name columns.
2. Remove the family name columns. There is a bug updating them and the
information is already available using a join.
3. Use a collation rather than the order_by columns.
I am not understanding every word of every post, but that's
all right, so let me just mention that when I read lines like
that I hope that nothing will (accidentally?) happen to stop
families being able to be sorted alphabetically.

The family filter (and rule, thanks Nick) which uses that was
done earlier this year, for a feature request on the tracker. I
enabled the BSDDB layer for sort family names by the given
name also, and Doug added that to the DB-API layer (thanks,
Doug).

I remember an earlier post in this thread mentioned that (and
that it was only used in the Family Group report), so I am just
mentioning it again to remind you that the functionality will
need to be preserved.

But of course perhaps I was just needlessly worried.

Thanks.
Nick Hall
2016-10-06 20:07:14 UTC
Permalink
Post by Paul Franklin
I am not understanding every word of every post, but that's
all right, so let me just mention that when I read lines like
that I hope that nothing will (accidentally?) happen to stop
families being able to be sorted alphabetically.
The sorted family handles code doesn't use the name fields in the family
table, it uses the name fields in the person table via a join.
Relational databases are very good at this.

Nick.
Paul Franklin
2016-10-06 23:56:26 UTC
Permalink
Post by Nick Hall
Post by Paul Franklin
I am not understanding every word of every post, but that's
all right, so let me just mention that when I read lines like
that I hope that nothing will (accidentally?) happen to stop
families being able to be sorted alphabetically.
The sorted family handles code doesn't use the name fields in the family
table, it uses the name fields in the person table via a join.
Relational databases are very good at this.
Thank you.
Tim Lyons
2016-10-05 22:45:07 UTC
Permalink
Post by Nick Hall
Post by Tim Lyons
(b) Would storing the data as json support joins of the data in the json?
I'm not quite sure what you mean. Table joins would be possible.
Sorry I wasn't clear. What I meant was, could you join one table using data
in the json with another table using a column in the second table? Could you
join one table using data in the json with another table using data in the
json of the second table? (In each case without exposing the data in the
json as a separate column).

You also say "If we used json to store the objects, then we could create an
index on a json expression.". I can see that you could use a json expression
in a WHERE statement to do a SELECT, but the sqlite json extension doesn't
seem to explicitly say say anything about creating indexes. I suppose you
could use a json expression in some index-create SQL command, but I don't
see how that would be much different than just using normal blob access to
create the index (OK, you could argue that the SQL is a bit more abstract
than Python, but Python is pretty abstract, and you have to be familiar with
the Python programming paradigms anyway to code in Gramps, so you are just
adding the burden of another paradigm).

Tim.



--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677050.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
John Ralls
2016-10-06 07:40:53 UTC
Permalink
Post by Tim Lyons
You also say "If we used json to store the objects, then we could create an
index on a json expression.". I can see that you could use a json expression
in a WHERE statement to do a SELECT, but the sqlite json extension doesn't
seem to explicitly say say anything about creating indexes. I suppose you
could use a json expression in some index-create SQL command, but I don't
see how that would be much different than just using normal blob access to
create the index (OK, you could argue that the SQL is a bit more abstract
than Python, but Python is pretty abstract, and you have to be familiar with
the Python programming paradigms anyway to code in Gramps, so you are just
adding the burden of another paradigm).
The expressions have to be something that the DB engine understands and they're not going to understand Python.

The downside of JSON is that being an extension it's not portable. Every DB that supports it does so differently so we have to write an abstraction layer with implementations for each DB that we want to use (our 3 sophisticated users will each want a different DB ;-)) to handle the differences.

Regards,
John Ralls
Tim Lyons
2016-10-06 16:41:49 UTC
Permalink
Post by John Ralls
On Oct 6, 2016, at 12:45 AM, Tim Lyons &lt;
You also say "If we used json to store the objects, then we could create an
index on a json expression.". I can see that you could use a json expression
in a WHERE statement to do a SELECT, but the sqlite json extension doesn't
seem to explicitly say say anything about creating indexes. I suppose you
could use a json expression in some index-create SQL command, but I don't
see how that would be much different than just using normal blob access to
create the index (OK, you could argue that the SQL is a bit more abstract
than Python, but Python is pretty abstract, and you have to be familiar with
the Python programming paradigms anyway to code in Gramps, so you are just
adding the burden of another paradigm).
The expressions have to be something that the DB engine understands and
they're not going to understand Python.
Thanks, I understand that, but it was not what I meant.

Sorry not to have been more clear, but what I meant was (and apologies if
the SQL is rubbish - it is just pseudo code...)

Second question, how do you create an index on a json expression?
=================================================================
Taking the example in the json1 document: Suppose the table "CREATE TABLE
user(name,phone)" stores zero or more phone numbers as a JSON array object
in the user.phone field. To find all users who have any phone number with a
704 area code:
SELECT DISTINCT user.name
FROM user, json_each(user.phone)
WHERE json_each.value LIKE '704-%';

You say "If we used json to store the objects, then we could create an index
on a json expression.". Now, the WHERE statement above does not create an
index. It might use an index if there were one...

To create an index, you would say something like:

CREATE INDEX user_phone ON user(json_each(user.phone))

Is that possible? and would the above WHERE statement be likely to take
advantage of it?

=================================================================
Surely it's just as easy to create an index using python:

for user in get_users():
for phone in user.phones:
pseudo_code_SQL(CREATE_INDEX user_phone ON user(phone))
=================================================================
First question, can you do a join on a json field?

Suppose we have a further table "CREATE TABLE rack(phone,details)" which
stores the details of the wiring rack for each phone number.

could we do a join of a json expression with a database column (to get a
table of names and rack details) like

SELECT name,details
FROM user JOIN rack
ON json_each(user.phone) = rack.phone
=================================================================
Then similarly could we do a join of a json expression with another json
expression JOIN...ON json_each(user.phone) = json.each(foo.bar)



regards,
Tim.




--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677062.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
Tim Lyons
2016-10-06 22:45:26 UTC
Permalink
Post by John Ralls
The downside of JSON is that being an extension it's not portable. Every
DB that supports it does so differently so we have to write an abstraction
layer with implementations for each DB that we want to use (our 3
sophisticated users will each want a different DB ;-)) to handle the
differences.
Well, I think that this point alone makes it abundantly clear that json is
not the broad direction for the project.

Regards,
Tim.



--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677069.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
Enno Borgsteede
2016-10-04 20:41:19 UTC
Permalink
John,
Post by John Ralls
ISTR that surname and given_name were indexed to make populating the
person listview and treeview faster. Those don't scale well in Gtk3;
Enno has complained about performance since 4.0 and in fact continues
to use 3.4 because of poor performance with his database. I've no
doubt that there are others with the same problem.
Most of my complaints about speed were cured by the caching code that
Doug made. The reason that I still use 3.4 is another one: The place
database. You can read about that in a recent thread on the user list,
in the "Gramps ate my data ..." thread.

Today, the situation is even worse, because the Gtk2 version included
with Debian testing, Mint 18, and ubuntu 16.04, kills Gramps 3.4 too.
Expanding the last node in a tree triggers an exception, and Gramps
freezes whenever you mess with surnames near the end of the alphabet.

I would love to move to a Gramps 5.1 or 6 with a better place model,
with proper support for addresses, but for now, 3.4 is better than 4.2
or 5.0, mainly because of the information that I still have in titles.

regards,

Enno
John Ralls
2016-10-04 20:47:23 UTC
Permalink
Post by Enno Borgsteede
John,
Post by John Ralls
ISTR that surname and given_name were indexed to make populating the
person listview and treeview faster. Those don't scale well in Gtk3;
Enno has complained about performance since 4.0 and in fact continues
to use 3.4 because of poor performance with his database. I've no
doubt that there are others with the same problem.
Most of my complaints about speed were cured by the caching code that
Doug made. The reason that I still use 3.4 is another one: The place
database. You can read about that in a recent thread on the user list,
in the "Gramps ate my data ..." thread.
Today, the situation is even worse, because the Gtk2 version included
with Debian testing, Mint 18, and ubuntu 16.04, kills Gramps 3.4 too.
Expanding the last node in a tree triggers an exception, and Gramps
freezes whenever you mess with surnames near the end of the alphabet.
I would love to move to a Gramps 5.1 or 6 with a better place model,
with proper support for addresses, but for now, 3.4 is better than 4.2
or 5.0, mainly because of the information that I still have in titles.
Enno,

Thanks for the clarification.

Regards,
John Ralls
Tim Lyons
2016-10-03 21:52:41 UTC
Permalink
Post by Nick Hall
What I am trying to understand is where we are heading. Are we aiming
for a normalised relational model, or perhaps a json object store, or a
hybrid design, or something else?
If 5.0 is still going to contain pickles, why do we need secondary
columns that are not indexed? They don't seem to be particularly useful
at the moment.
I am strongly of the opinion that we shouldn't have things in the code
and/or database that are not used (and used effectively). If the code
doesn't use them, then we don't know that we have got the right interface
with the right parameters. That applies to indexed fields, or fields that
are exposed in the SQL data model (as well as code methods [1]).

I don't think we should include things like indexed fields just because "it
seems obvious that they will be useful in the future and we want them
available so that people can try using them". If we want to try something
using them, we should create a GEPS and a GEPS branch, and put the new
database structure and code in there and make changes throughout Gramps to
use the new structures.

I would be very concerned about incremental changes in a series of releases
to use a series of different features, especially because each change would
mean database conversion, and that is difficult to implement reliably,
because it is a one-off all-or-nothing operation.

That raises the question of what is the change for?

(1) some people seem to want it because they imagine using SQL queries on
the Gramps database, either from within some new add-on, or outside Gramps.
The argument above applies - lets see it used first in a GEPS (otherwise
it's just a fantasy).

(2) some people seem to think that the SQL backends may be more reliable
than BSDDB. I've got sympathy, because I am very concerned that BSDDB has
not been very reliable for us, but I think that could be because we are not
using it correctly. How will we be sure we are using DBAPI correctly (or is
it the case that there are fewer options in DBAPI and hence we are more
likely to be using it correctly)?

(3) I am particularly concerned that we are implementing transactions
correctly, and that seems particularly hard to test.

(4) To what extent do we want to consider remaining compatible with
gramps-connect? I note that you mention some things are used by
gramps-connect - is that through the normal gramps db interfaces or does it
use special lower level interfaces?


If we are eventually going for a model where all the data is exposed in SQL,
then I think that would be Gramps 6.0, not an evolution of Gramps 5.


Regards,
Tim.



[1] The only exception is that is we have a set of foo_*_bar methods for *
in object_type, then if many (but not all) of those methods are used, I
think it is probably sensible to include the complete set.




--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677012.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
Tim Lyons
2016-10-03 22:17:50 UTC
Permalink
Performance.

It seems to be generally accepted that we should aim for a Gramps that will
cope well with Very Large datasets.

I presume that in order to do that we would want Gramps to not rely on
holding all data in memory, but just retrieving parts as needed.

As I understand the discussion about cursors, Gramps/BSDDB is designed so
that you could use a cursor to retrieve keys one at a time and then use them
to retrieve the records as you need them. And DBAPI would not be able to do
that because it works on result sets (which of course could be SELECT * to
get a whole table or view).

So in principle, BSDDB would be better for large datasets.

However, am I right in thinking that none of this applies because:

(1) BSDDB constructs a dictionary of all keys anyway (and the corresponding
Map class in DBAPI does the same), so they at least the keys are in memory
anyway.

(2) Any of the list/tree views can be sorted by any column, which would
require access to all the underlying data to get the sort field, so you need
to retrieve everything anyway.

(3) If you solve the sort problem in DBAPI by exposing suitable fields and
doing a join, SQL would retrieve everything anyway because it does a SELECT
*).

(4) Maybe none of this is very relevant in a world where memory (and
especially virtual memory) is very plentiful and pretty quick.

(5) Actually most of the slowness probably comes about because of other
factors (e.g. maybe import has lots of sequential searches).

Tim.



--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677013.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
Benny Malengier
2016-10-04 07:11:34 UTC
Permalink
Post by Tim Lyons
(2) Any of the list/tree views can be sorted by any column, which would
require access to all the underlying data to get the sort field, so you need
to retrieve everything anyway.
(3) If you solve the sort problem in DBAPI by exposing suitable fields and
doing a join, SQL would retrieve everything anyway because it does a SELECT
*).
(4) Maybe none of this is very relevant in a world where memory (and
especially virtual memory) is very plentiful and pretty quick.
The idea for Very Large Datasets was that new views would be available to
work on those. I wrote such a thing commercially. What you do is that you
start with a filter. People enter the data they want to see, eg Lyon*, then
press search. Only then do they see a listview with the results. Columns
not sortable. If indexes on certain columns, the filter has checkbox to set
the sort field.

One idea was to have a hard coded value based on the internal RAM available
to decide from which amount of people in the database the current listviews
would be blocked and no longer available due to too slow.

All ideas only, the point is only that for very large datasets the
list/treeviews cannot be used, but these filter first views are not a
problem on a well constructed database. Current bsddb can do it.

benny
John Ralls
2016-10-04 08:45:26 UTC
Permalink
Post by Tim Lyons
(2) Any of the list/tree views can be sorted by any column, which would
require access to all the underlying data to get the sort field, so you need
to retrieve everything anyway.
(3) If you solve the sort problem in DBAPI by exposing suitable fields and
doing a join, SQL would retrieve everything anyway because it does a SELECT
*).
(4) Maybe none of this is very relevant in a world where memory (and
especially virtual memory) is very plentiful and pretty quick.
The idea for Very Large Datasets was that new views would be available to work on those. I wrote such a thing commercially. What you do is that you start with a filter. People enter the data they want to see, eg Lyon*, then press search. Only then do they see a listview with the results. Columns not sortable. If indexes on certain columns, the filter has checkbox to set the sort field.
One idea was to have a hard coded value based on the internal RAM available to decide from which amount of people in the database the current listviews would be blocked and no longer available due to too slow.
All ideas only, the point is only that for very large datasets the list/treeviews cannot be used, but these filter first views are not a problem on a well constructed database. Current bsddb can do it.
There's another approach to this, compatible with the current GUI's treeview. I'll use SQL to describe the queries because it's easy to understand. There's an underlying assumption that one's database isn't a one-name study on a common surname like Smith.

The person treeview starts with just a list of surnames, so the initial query to load it is just SELECT surname FROM person. When the user clicks the disclosure button on a particular surname (let's use Smith), we run SELECT first_name, birthdate, deathdate FROM person WHERE surname = 'Smith' and use the result to populate the second tier of the treeview. When the user clicks a different surname we free the Smiths and run the second query with the new surname.

A third approach, this time for the listview, is to use TOP/LIMIT/OFFSET (the keyword depends on the db manager) to retrieve only some number of rows at a time. When the user scrolls close to one of the edges we run a new query for half the original number past that edge and free the same number from the other edge. A jump causes a complete free and re-query. There might be a perceptible lag in scrolling but it's likely to be a lot smaller than the awful lag Gtk3 GtkListView has when displaying more that a couple thousand rows. I think Doug has implemented something like this already in both the BDB and DBAPI code because of the GtkListView problems Enno reported.

Regards,
John Ralls
Nick Hall
2016-10-04 13:02:54 UTC
Permalink
Post by John Ralls
There's another approach to this, compatible with the current GUI's
treeview. I'll use SQL to describe the queries because it's easy to
understand. There's an underlying assumption that one's database isn't
a one-name study on a common surname like Smith.
The person treeview starts with just a list of surnames, so the
initial query to load it is just SELECT surname FROM person. When the
user clicks the disclosure button on a particular surname (let's use
Smith), we run SELECT first_name, birthdate, deathdate FROM person
WHERE surname = 'Smith' and use the result to populate the second tier
of the treeview. When the user clicks a different surname we free the
Smiths and run the second query with the new surname.
You probably don't remember, but I wrote a prototype for something like
this back in April 2015. I used two list views next to each other. The
left hand view contained a list of surnames. The right hand view was
populated according to the selected surname.

I used a similar approach with sources and citations.
Post by John Ralls
A third approach, this time for the listview, is to use
TOP/LIMIT/OFFSET (the keyword depends on the db manager) to retrieve
only some number of rows at a time. When the user scrolls close to one
of the edges we run a new query for half the original number past that
edge and free the same number from the other edge. A jump causes a
complete free and re-query. There might be a perceptible lag in
scrolling but it's likely to be a lot smaller than the awful lag Gtk3
GtkListView has when displaying more that a couple thousand rows. I
think Doug has implemented something like this already in both the BDB
and DBAPI code because of the GtkListView problems Enno reported.
Doug extended the LRU cache to all columns, and increased the default
size to 1000 rows. This made the scrolling smoother, but increased the
load time. As the tree models are also used in the selectors, this can
also be annoying.


Nick.
John Ralls
2016-10-04 08:55:14 UTC
Permalink
Post by Tim Lyons
Performance.
It seems to be generally accepted that we should aim for a Gramps that will
cope well with Very Large datasets.
I presume that in order to do that we would want Gramps to not rely on
holding all data in memory, but just retrieving parts as needed.
As I understand the discussion about cursors, Gramps/BSDDB is designed so
that you could use a cursor to retrieve keys one at a time and then use them
to retrieve the records as you need them. And DBAPI would not be able to do
that because it works on result sets (which of course could be SELECT * to
get a whole table or view).
So in principle, BSDDB would be better for large datasets.
(1) BSDDB constructs a dictionary of all keys anyway (and the corresponding
Map class in DBAPI does the same), so they at least the keys are in memory
anyway.
(2) Any of the list/tree views can be sorted by any column, which would
require access to all the underlying data to get the sort field, so you need
to retrieve everything anyway.
(3) If you solve the sort problem in DBAPI by exposing suitable fields and
doing a join, SQL would retrieve everything anyway because it does a SELECT
*).
(4) Maybe none of this is very relevant in a world where memory (and
especially virtual memory) is very plentiful and pretty quick.
(5) Actually most of the slowness probably comes about because of other
factors (e.g. maybe import has lots of sequential searches).
The BDB cursor reads one pickle into memory at a time, from which we can extract only the fields we need for whatever is the current use, free both the pickle and the extracted object, and repeat for the next one. The resulting dictionary will in most cases be smaller than having the whole objects in memory all at once.

A SQL query on a normalized database will have the same result with less work on our part. A SQL query on a pickled store can't do that, it can only retrieve all of the pickles for us to iterate over.

Once the GtkTreeModel or GtkListModel is loaded we have no more need for the pickles or the objects. Sort operations are performed on the GtkTree/ListModel, not on the underlying data.

Regards,
John Ralls
Tim Lyons
2016-10-03 20:37:26 UTC
Permalink
Post by John Ralls
I don't quite understand the point you're trying to make. The tables don't
have the needed columns because we have all of the useful data buried in
pickles where the database engine can't see it. Getting rid of the pickles
is a prerequisite to my ideal case and to making good use of a SQL
database. If we're going to keep pickling then we should simply dump
DB-API altogether. It's stupid to have all of that overhead when all we
want is a key-value store.
Just to understand what you are saying, what overhead:

Are you saying that a SQL database has inherently more overhead (even if it
is just used as a key-value store), or are you saying that there is more
overhead because more fields are indexed (so if only used as a key-value
store it is approximately the same)?



--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677007.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
John Ralls
2016-10-03 21:28:27 UTC
Permalink
Post by Tim Lyons
Post by John Ralls
I don't quite understand the point you're trying to make. The tables don't
have the needed columns because we have all of the useful data buried in
pickles where the database engine can't see it. Getting rid of the pickles
is a prerequisite to my ideal case and to making good use of a SQL
database. If we're going to keep pickling then we should simply dump
DB-API altogether. It's stupid to have all of that overhead when all we
want is a key-value store.
Are you saying that a SQL database has inherently more overhead (even if it
is just used as a key-value store), or are you saying that there is more
overhead because more fields are indexed (so if only used as a key-value
store it is approximately the same)?
Tim,

Yes, a SQL database brings a bunch of overhead with it. That's the tradeoff one makes when using a higher-level abstraction library instead of a low-level one. Gramps is way down that road already by using Python instead of C++ (the developers of the latter being intent on being the exception that proves the rule ;-)) or C.

It's often worthwhile--nearly always worthwhile--to accept the overhead to gain the abstraction. Using a SQL backend correctly can make the database code more robust because the niggly bits that we have to deal with now are handled by programmers who have a lot of practice with those niggly bits. It can make the code easier to understand by people who haven't spent months studying the gramps codebase because SQL is fairly simple and very widely understood. It makes writing new reports and views and adding new data elements easier for the same reason. But it does all of that wonderful stuff for us only if we use it the way it's intended to be used. If we continues to use the database as a simple key-value store we've accepted the extra overhead and gained absolutely nothing for it.

Regards,
John Ralls
John Ralls
2016-10-04 07:02:04 UTC
Permalink
Thanks for a very clear explanation.
Post by John Ralls
Gramps is way down that road already by using Python instead of C++ (the developers of the latter being intent on being the exception that proves the rule ;-))
Do you mean that C++ is intent on proving that a higher-level abstraction can still have little overhead?
Yes.
Post by John Ralls
or C.
It's often worthwhile--nearly always worthwhile--to accept the overhead to gain the abstraction.
As someone pointed out to me just the other day, if it isn't, just wait a year, till Moore's law makes the hardware catch up with the inefficient software.
That's less true than it was a few years ago, and it presupposes that our users are all running the latest-and-greatest. Along that line, are you still using that PPC mac? I'm contemplating dropping PPC support for 5.0 and moving the minimum OSX version to 10.9.
Post by John Ralls
Using a SQL backend correctly can make the database code more robust because the niggly bits that we have to deal with now are handled by programmers who have a lot of practice with those niggly bits.
I hope that applies to the ACID aspects of the DBAPI database, but I still want convincing by understanding and examination of the codebase as much as by testing (which I think may be unfeasible).
Indeed, I don't know of a good way to test for ACID compliance. The sqlite3 module documentation has some good information on SQLite3 transactional behavior beginning at https://docs.python.org/3/library/sqlite3.html#controlling-transactions that you might find useful as preparation for a code review.
Post by John Ralls
It can make the code easier to understand by people who haven't spent months studying the gramps codebase because SQL is fairly simple and very widely understood. It makes writing new reports and views and adding new data elements easier for the same reason. But it does all of that wonderful stuff for us only if we use it the way it's intended to be used. If we continues to use the database as a simple key-value store we've accepted the extra overhead and gained absolutely nothing for it.
Maybe we gain reliability?
Considering the problems we've had with BDB environments getting screwed up it's hard to see how we'd lose reliability. Considering how widespread the use of BDB is I think that's likely our fault rather than an inherent problem with BDB so if reliability was the main goal I'd think that understanding why user's DbEnvs get screwed up and fixing it would be more profitable.

Regards,
John Ralls
Tim Lyons
2016-10-04 22:39:47 UTC
Permalink
Post by John Ralls
Post by John Ralls
It's often worthwhile--nearly always worthwhile--to accept the overhead
to gain the abstraction.
As someone pointed out to me just the other day, if it isn't, just wait a
year, till Moore's law makes the hardware catch up with the inefficient
software.
That's less true than it was a few years ago, and it presupposes that our
users are all running the latest-and-greatest. Along that line, are you
still using that PPC mac? I'm contemplating dropping PPC support for 5.0
and moving the minimum OSX version to 10.9.
I am still using PPC Mac (as well as occasionally Intel). I see that PPC Mac
downloads are still of the order of 10%% of the Intel Mac downloads [1] (and
the same ratio seems to hold very roughly for Gnucash I think [2]).

Tim.


[1]
http://www.somsubhra.com/github-release-stats/?username=gramps-project&repository=gramps
[2]
https://sourceforge.net/projects/gnucash/files/gnucash%20%28stable%29/2.6.14/

http://www.somsubhra.com/github-release-stats/?username=gnucash&repository=gnucash
shows different ratios. 2.6.12 and 2.6.11 show much more PPC download usage.




--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677029.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
John Ralls
2016-10-05 08:42:36 UTC
Permalink
Post by Tim Lyons
Post by John Ralls
Post by John Ralls
It's often worthwhile--nearly always worthwhile--to accept the overhead
to gain the abstraction.
As someone pointed out to me just the other day, if it isn't, just wait a
year, till Moore's law makes the hardware catch up with the inefficient
software.
That's less true than it was a few years ago, and it presupposes that our
users are all running the latest-and-greatest. Along that line, are you
still using that PPC mac? I'm contemplating dropping PPC support for 5.0
and moving the minimum OSX version to 10.9.
I am still using PPC Mac (as well as occasionally Intel). I see that PPC Mac
downloads are still of the order of 10%% of the Intel Mac downloads [1] (and
the same ratio seems to hold very roughly for Gnucash I think [2]).
Tim.
[1]
http://www.somsubhra.com/github-release-stats/?username=gramps-project&repository=gramps
[2]
https://sourceforge.net/projects/gnucash/files/gnucash%20%28stable%29/2.6.14/
http://www.somsubhra.com/github-release-stats/?username=gnucash&repository=gnucash
shows different ratios. 2.6.12 and 2.6.11 show much more PPC download usage.
For GnuCash in the last week, yes. Just under 7% of Intel for 2.6.14 since its release on September 17th(274 vs. 4137), and just under 8% for 2.6.13 since its release in June: 1180 vs. 15138.

Can't tell from the somsubhra page what date range they're covering, but based on the numbers for the older releases I suspect it's limited somehow. People are still downloading Gramps from SourceForge as well, and we can set date ranges on those stats. 4.2.4 has had 31 PPC and 147 Intel downloads since the release; but all of the numbers this year for Intel are likely screwed up by the (I hope now resolved) code-signing issues. 4.2.0 from last year had 241 total PPC downloads from SF vs. 2575 Intel. But here's an interesting statistic: Only 61% of the PPC downloaders were using a Mac browser. It's possible, I suppose, that the other 100 or so sensibly downloaded from their Windows machines and then transferred the file to their PPCs, not wanting such an antique OS exposed to the net. But I doubt it. I don't think many Gramps users are that technically sophisticated. More likely better than a third of the PPC downloads are mistakes. There's no telling how many of the PPC downloads were really Intel users who clicked the wrong link. Meanwhile for 4.2.3, released this spring, of the 194 downloads of the PPC bundle from SF only 18 used a Mac browser. Even for the 541 Intel downloads only 375 reported themselves as Macs. Same problem on Gnucash: For 2.6.13 fewer than half (469 of 1180) of the PPC downloads were from Macs, the bulk of the rest being "Unknown" (547).How many of those Mac users were really Intel users who clicked the wrong link isn't known. Contrast that with Intel, where 12110 of the 15138 downloads (79%) were from Mac browsers.

The other half of the equation is that it's getting increasingly difficult to reliably build the Gtk+ stack with the Xcode 3 toolchain. Some packages build only with the "real" gcc-4.2.1 compiler, others only with llvm-gcc. The result has serious stability issues with any Gtk after 3.14 -- 3.22 was just released. That's a lot of bug-fixes we're missing out on. Note that by building with the Xcode 6 toolchain (llvm/clang only) and targeting 10.9 all of the old 32-bit Intel boxes stuck on 10.6 and the few 64-bit ones with the old EFI stuck on 10.7 also lose out. There's another round of machines that won't be able to upgrade to 10.12 that are going to lose support in a few years. That's too bad, but it's unreasonable to expect that ancient hardware will be supported with new software forever.

Regards,
John Ralls
Tim Lyons
2016-10-05 23:04:55 UTC
Permalink
Post by John Ralls
People are still downloading Gramps from SourceForge as well, and we can
set date ranges on those stats. 4.2.4 has had 31 PPC and 147 Intel
downloads since the release; but all of the numbers this year for Intel
are likely screwed up by the (I hope now resolved) code-signing issues.
4.2.0 from last year had 241 total PPC downloads from SF vs. 2575 Intel.
But here's an interesting statistic: Only 61% of the PPC downloaders were
using a Mac browser. It's possible, I suppose, that the other 100 or so
sensibly downloaded from their Windows machines and then transferred the
file to their PPCs, not wanting such an antique OS exposed to the net. But
I doubt it. I don't think many Gramps users are that technically
sophisticated. More likely better than a third of the PPC downloads are
mistakes. There's no telling how many of the PPC downloads were really
Intel users who clicked the wrong link. Meanwhile for 4.2.3, released this
spring, of the 194 downloads of the PPC bundle from SF only 18 used a Mac
browser. Even for the 541 Intel downloads only 375 reported themselves as
Macs. Same problem on Gnucash: For 2.6.13 fewer than half (469 of 1180) of
the PPC downloads were from Macs, the bulk of the rest being "Unknown"
(547).How many of those Mac users were really Intel users who clicked the
wrong link isn't known. Contrast that with Intel, where 12110 of the 15138
downloads (79%) were from Mac browsers.
That's very strange, when usually SF detection of your OS means you are
offered Intel even if you are on PPC. I know it depends on how you get to
the download page, but you would think that mistake would compensate for PPC
downloads being mistakes. Why don't Intel users make similar mistakes as
often? I'd expect PC users to click on the Intel one instead of the Windows
one.

PPC can't use Firefox now and has to use TenFourFox. Perhaps that has a
greater propensity to lie about who it is (may explain why so many downloads
are apparently not from Macs) although my user agent string seems truthful.

I know there is no answer, just saying it's peculiar!

Tim.



--
View this message in context: http://gramps.1791082.n4.nabble.com/DB-API-code-review-tp4676981p4677052.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.
John Ralls
2016-10-06 07:21:32 UTC
Permalink
Post by Tim Lyons
Post by John Ralls
People are still downloading Gramps from SourceForge as well, and we can
set date ranges on those stats. 4.2.4 has had 31 PPC and 147 Intel
downloads since the release; but all of the numbers this year for Intel
are likely screwed up by the (I hope now resolved) code-signing issues.
4.2.0 from last year had 241 total PPC downloads from SF vs. 2575 Intel.
But here's an interesting statistic: Only 61% of the PPC downloaders were
using a Mac browser. It's possible, I suppose, that the other 100 or so
sensibly downloaded from their Windows machines and then transferred the
file to their PPCs, not wanting such an antique OS exposed to the net. But
I doubt it. I don't think many Gramps users are that technically
sophisticated. More likely better than a third of the PPC downloads are
mistakes. There's no telling how many of the PPC downloads were really
Intel users who clicked the wrong link. Meanwhile for 4.2.3, released this
spring, of the 194 downloads of the PPC bundle from SF only 18 used a Mac
browser. Even for the 541 Intel downloads only 375 reported themselves as
Macs. Same problem on Gnucash: For 2.6.13 fewer than half (469 of 1180) of
the PPC downloads were from Macs, the bulk of the rest being "Unknown"
(547).How many of those Mac users were really Intel users who clicked the
wrong link isn't known. Contrast that with Intel, where 12110 of the 15138
downloads (79%) were from Mac browsers.
That's very strange, when usually SF detection of your OS means you are
offered Intel even if you are on PPC. I know it depends on how you get to
the download page, but you would think that mistake would compensate for PPC
downloads being mistakes. Why don't Intel users make similar mistakes as
often? I'd expect PC users to click on the Intel one instead of the Windows
one.
PPC can't use Firefox now and has to use TenFourFox. Perhaps that has a
greater propensity to lie about who it is (may explain why so many downloads
are apparently not from Macs) although my user agent string seems truthful.
I know there is no answer, just saying it's peculiar!
Tim,

SF offers Intel for macs because we tell it to. There's only checkbox for Macs on the info dialog. Regardless of how you get to the download page that page reads and records the user-agent string for the statistics.

There may be as many PPC users accidentally downloading Intel as the other way around, but because there aren't many surviving PPC Macs the result is in the noise for the Intel statistics.

Another source of the "other" downloads is secondary download sites like CNET.com that hoover up packages to download so that they can push their own ads at users.

Regards,
John Ralls
Nick Hall
2016-10-02 14:32:18 UTC
Permalink
Post by John Ralls
A general comment to get the discussion started: Our current design is based on a key (handle) and a pickled object, stored as a blob. This makes some sense for the low-level BDB API because it has two fields, the key used to find a record and the record itself. Python provides the pickle/unpickle API to easily serialize and deserialize objects, so why not? SQL works differently, and most of its power comes from having every field defined and visible to the database engine. The database engine can't see inside the pickle so can't make any use of that data. Converting to a SQL backend calls for a substantial redesign of the way that Gramps interacts with the backend. That doesn't have to be done for 5.0, in fact it makes sense to provide the backend first and then incrementally rework the rest to make use of the new capabilities. We'll have to accept that there may be a performance cost for large datasets in the meantime.
I notice that there is a bug report related to this:

9392: Do we need to pickle DB-API blobs?

https://gramps-project.org/bugs/view.php?id=9392

Josip made the following comment:

"SQLite since version 3.9.0 have JSON support
https://www.sqlite.org/json1.html
with that it will be possible to use database outside of Gramps and
without Python."

This is an interesting option. It would allow data within an object to
be accessed and indexed directly without duplication.


Nick.
John Ralls
2016-10-02 15:34:32 UTC
Permalink
Post by Nick Hall
"SQLite since version 3.9.0 have JSON support
https://www.sqlite.org/json1.html
with that it will be possible to use database outside of Gramps and
without Python."
This is an interesting option. It would allow data within an object to
be accessed and indexed directly without duplication.
Postgres can also do JSON columns. In fact it has both json (which is text based and has to be reparsed when querying it) and jsonb which is stored in parsed form and can be indexed and queried much more efficiently.
How compatible are the APIs? IOW, would we need separate modules for SQLite3 JSON and PGSQL JSON?

Regards,
John Rallw
Tom Hughes
2016-10-02 15:49:08 UTC
Permalink
Post by John Ralls
Post by Nick Hall
"SQLite since version 3.9.0 have JSON support
https://www.sqlite.org/json1.html
with that it will be possible to use database outside of Gramps and
without Python."
This is an interesting option. It would allow data within an object to
be accessed and indexed directly without duplication.
Postgres can also do JSON columns. In fact it has both json (which is text based and has to be reparsed when querying it) and jsonb which is stored in parsed form and can be indexed and queried much more efficiently.
How compatible are the APIs? IOW, would we need separate modules for SQLite3 JSON and PGSQL JSON?
I suspect you're right about that. Details of the Postgres one are here:

https://www.postgresql.org/docs/9.6/static/datatype-json.html

Tom
--
Tom Hughes (***@compton.nu)
http://compton.nu/
John Ralls
2016-10-02 15:57:57 UTC
Permalink
Post by Tom Hughes
Post by John Ralls
Post by Nick Hall
"SQLite since version 3.9.0 have JSON support
https://www.sqlite.org/json1.html
with that it will be possible to use database outside of Gramps and
without Python."
This is an interesting option. It would allow data within an object to
be accessed and indexed directly without duplication.
Postgres can also do JSON columns. In fact it has both json (which is text based and has to be reparsed when querying it) and jsonb which is stored in parsed form and can be indexed and queried much more efficiently.
How compatible are the APIs? IOW, would we need separate modules for SQLite3 JSON and PGSQL JSON?
https://www.postgresql.org/docs/9.6/static/datatype-json.html
Yeah, completely different API from SQLite3, which uses named functions to access JSON.

Regards,
John Ralls
Loading...