[ale] MySQL help

Mon Mar 23 22:38:11 EDT 2015

Ha!  Well, sometimes excellence goes unrecognized and even unnoticed.  Your
writing and explanations deserve recognition and applause!  I humbly offer
mine.  Too often we have to suffer through mediocre writing.  Not so with
you.  So thanks again!
Dave

On Fri, Mar 20, 2015 at 7:51 PM, Alex Carver <agcarver+ale at acarver.net>
wrote:

> You're quite welcome, I'm glad you got something out of it.  As for
> technical writing, I wrote too much in grad school and still have to
> write too much in my current job to want to go into voluntary technical
> writing professionally. ;)
>
> Of course that changes if I ever attempt to secure a professorship at a
> university somewhere in which case I'd be writing lesson plans and
> textbooks.  But that's a bit further off and the writing is still not
> completely voluntary but at least I'd have some advance warning. ;)
>
> On 2015-03-20 15:44, David Jackson wrote:
> > Hi Alex,
> >
> > No doubt we'll get to Normal Form at some point, but I want to thank you
> > for your most excellent explanation here.  Whatever text we read when we
> do
> > get to it would most definitely benefit from your write-up!  Have you
> ever
> > considered technical writing?  I would bet you've done a lot, because I
> > know that explanations this clear don't just happen accidentally.  I wish
> > my texts were as clear as your explanations!
> >
> > Thanks again!
> > Dave
> >
> > On Sat, Mar 14, 2015 at 7:09 PM, Alex Carver <agcarver+ale at acarver.net>
> > wrote:
> >
> >> I know it was an assignment so you're limited but it was something to
> >> keep in mind as the construction of that database is not the best
> >> example of a proper design.
> >>
> >> The rule isn't for MySQL specifically, it's for database design in
> >> general and is called Normal Form.  It applies to all relational
> >> databases independent of the engines that support them (MySQL, Postgres,
> >> etc.)  The concept isn't that the data is atomic more that it is not
> >> repeated unnecessarily which maximizes storage (this is still a critical
> >> thing for hypergiant databases used by NASA, Google, Amazon, etc.) and
> >> also allows better query structure and operation.
> >>
> >> There are multiple levels of Normal Form starting with Zero, which has
> >> no optimizations, going up to Five, which has the most (and sometimes
> >> very complicated) optimizations (there are sometimes "extra" unofficial
> >> levels referenced above Five but they dont' make much sense.)  Where you
> >> stop all depends on how easy it is to visualize the database structure
> >> at the higher NF levels (so that it can be implemented) and whether the
> >> data is a natural fit.  In most cases you end up with a hybrid database
> >> where most is at some NF level and small portions are at a lower level.
> >>
> >> The most common textbook examples for addressing NF start off with basic
> >> address books/contacts lists.  You start with an address book with a
> >> person's name and postal address:
> >>
> >> John Smith  123 Main St, Springville, IL 12345
> >> Jane Doe    456 Side St, Springhills, IL 12345
> >> Jim  Johns  789 Road St, Springvalley, IL 12345
> >> Jill Dale   123 High Rd, Fallglen, OH 23456
> >> Joe  Jims   456 Low Rd,  Fallcrest, OH 23456
> >>
> >> In a zero NF database, these would be recorded one at a time, lots of
> >> repeated information (obviously the ZIP and state).  So it's obvious
> >> that there is a minimization that can occur with the state at a minimum
> >> by breaking it out into a table and then relating the address book table
> >> with the state table.  Now when you store a state, you only have an
> >> integer between 0 and 49 (assuming only the 50 states) and integer
> >> storage is more efficient than a string of two characters in most
> >> database engines.
> >>
> >> A less obvious one is the ZIP which actually could be broken into a
> >> small set of foreign lookup tables.  The number of ZIPs is limited and
> >> only certain ZIPs exist in certain states because the US is divided into
> >> 10 regions.  So you could have at least one table that matched a ZIP
> >> with a state and use that in the address book instead.  Then you store
> >> the ZIP only once, the state is already matched to the ZIP and thus the
> >> state is only stored once.  The address book then only needs to point to
> >> the ZIP table and the ZIP table maps to the state so you can recreate
> >> the state information without having to have a "state" column in the
> >> address book.
> >>
> >> By example:
> >> ZIP_State_Table:
> >> ID  ZIP      State
> >> 1   12345    1
> >> 2   12346    1
> >> 3   23456    2
> >> 4   23457    2
> >>
> >> State_Table:
> >> ID       State
> >> 1        IL
> >> 2        OH
> >>
> >> Now the address book only needs:
> >> Name, Street Address, City, ZIP_ID
> >>
> >> Note that the state field is not required.  You get the state by
> >> chaining the three tables together.  And the ZIP doesn't have to be
> >> explicitly stored, just the ID (so that other postal codes could be used
> >> later such as Canadian).  You could extend this to a table of Cities
> >> since ZIPs encode onto cities and that would eliminate the need to store
> >> the city in the address book.  This is getting closer to Four and Five
> >> NF since you're abstracting a lot of data and really paring down the
> >> original table.  The reality is that most places like Amazon do exactly
> >> this.  You don't need to store everyone's city over and over (and waste
> >> the space doing so) because you know every city.  Just record them once
> >> and point to them.
> >>
> >> The other part of NF is where pointers should go.  This is less an
> >> atomicity of data and more of a logical structuring.  For your database,
> >> all animals need food so the animal table should really carry the food.
> >>  The food table itself is only the NF conversion to minimize data
> >> replication.  Since foods can be eaten by more than one animal, the
> >> current arrangement of your example tables (with animal ID in the food
> >> table) causes unnecessary replication of data (a food gets repeated)
> >> which goes against NF rules.  The map table I suggested goes up another
> >> level of NF because it avoids replication of data in the animal table,
> >> too, since one animal can eat more than one food.  But, of the two
> >> evils, replicating the animal was the lesser evil because it's more
> >> logical in a database sense compared to replicating food mainly because
> >> the food is more likely to be replicated with greater frequency than the
> >> animals (any individual animal might eat less than five food items but
> >> any food item can potentially be eaten by an entire zoo's population so
> >> you have many more potential repetitions of the food than the
> animal).[1]
> >>
> >> In the end there's no real limit to how far you should go in terms of NF
> >> other than whether it makes implementing the database much harder than
> >> needed for the task.  For Amazon, high level NF is crucial otherwise
> >> they'll burn through disk space in a hurry.  For a small DB, maybe not
> >> critical at all.  My personal address book might only be 50 entries so
> >> do I need to spend the time doing it?  Not really, there's no savings
> >> for me except that it's just good practice.
> >>
> >> NF is not hard and fast but it's very useful for making databases work
> >> better.  It also allows you to eventually slice and dice your data in
> >> ways that might otherwise be very difficult to achieve with flat or low
> >> NF databases.  It's harder to create an example of this since easy to
> >> type and follow database structures don't lend themselves to difficult
> >> analysis (by design).  But big research databases are a prime place
> >> where this happens.  A researcher may piece together data in various
> >> ways to extract trends but this would be difficult to do if the database
> >> was a monolithic table or otherwise used a very low order NF that didn't
> >> break out all the data as much as possible.
> >>
> >>
> >> [1] Another way to visualize this is to switch the animals to zoo
> >> visitors (humans) and the food that the visitors eat at the zoo (which
> >> is usually a limited set of park fare like hot dogs, hamburgers, chicken
> >> wings, soda, etc.).  If you had to track every visitor and what they ate
> >> for lunch, are you more likely to put the pointer to food in the visitor
> >> table or a pointer to the visitor in the food table?  If you put the
> >> food pointer in the visitor table, then each visitor might have two or
> >> three records (one visitor bought a soda, a hot dog, and a soft pretzel
> >> so three records for John Doe).  The food table here would still only
> >> have the few items you sell  at the zoo (hot dog, hamburger, chicken,
> >> soda, water, pretzel, ice cream.)
> >>
> >> But how many of those visitors at a busy zoo bought a hot dog?  If the
> >> pointer is reversed and your food table points at a visitor then your
> >> food table might suddenly have 10,000 hot dog entries.
> >>
> >> If your zoo gets 100,000 people a day (big zoo) and every person buys
> >> three food items on average then your day's table of visitors will be
> >> 300,000 records and the food table has just 7 using the first
> >> arrangement (visitor points to food).  If you reverse the pointer (food
> >> points to visitor), your visitor table is 100,000 records (every
> >> visitor) and each item in your food table might have been purchased by a
> >> third of the visitors for a total of over 230,000 records.
> >>
> >> The final tally, visitor points to food is 300,007 records and food
> >> points to visitor is 330,000 records.  Net savings of 30,000 records.
> >>
> >>
> >> On 2015-03-14 14:34, David Jackson wrote:
> >>> Thanks Alex!  Wow.  Very impressive analysis!
> >>>
> >>> This has been for an assignment, so I couldn't change the tables, but
> >> your
> >>> way of gluing the info together makes sense to me, so thanks!  I think
> >>> they're just trying to keep it simple for the class
> >>>
> >>> It looks like for MySQL you want to "atomize" the data as much as
> >> possible
> >>> to keep the structure out of the way of possible relations among the
> >>> data??  I guess it's a judgment call to not get too "atomic" though,
> huh?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Fri, Mar 13, 2015 at 5:55 PM, Alex Carver <agcarver+ale at acarver.net
> >
> >>> wrote:
> >>>
> >>>> There are four JOINs available to you, LEFT, RIGHT, INNER (this one is
> >>>> usually just the unadorned JOIN keyword) and OUTER.
> >>>>
> >>>> To see if you have an animal with no food and given that your food
> table
> >>>> is on the right of the animal table in the query you offered you need
> to
> >>>> use a LEFT JOIN.  The LEFT JOIN ensures that the left table takes
> >>>> precedence during row generation such that missing matches result in
> >>>> NULLs in the fields from the right-side tables:
> >>>>
> >>>> So change your query to:
> >>>>
> >>>> SELECT id, name, family, feed FROM animal LEFT JOIN food ON animal.id
> =
> >>>> food.animalid;
> >>>>
> >>>> That gives you everything.  Then add a WHERE clause to filter out
> >>>> animals missing food:
> >>>>
> >>>> SELECT id, name, family, feed FROM animal LEFT JOIN food ON animal.id
> =
> >>>> food.animalid WHERE feed IS NULL;
> >>>>
> >>>> The unadorned JOIN (which is a INNER or STRAIGHT JOIN) will only show
> >>>> records where there are matches in all joined tables.  The RIGHT JOIN
> >>>> variant will reverse the precedence of the tables (LEFT and RIGHT
> JOINs
> >>>> work the same, the precedence is reversed and it's mainly used when
> you
> >>>> join many tables so that you get the precedence correct for the needed
> >>>> query).
> >>>>
> >>>> On an aside, the table as constructed isn't quite proper according to
> >>>> the rules of Normal Form (how databases are laid out).  Your food
> table
> >>>> should not have an animal ID.  Instead your animal should have a food
> ID
> >>>> if it can only eat one type of food.  Alternatively there should be a
> >>>> third table which maps animal ID to food ID that way an animal can eat
> >>>> more than one food.
> >>>>
> >>>> Two table variant:
> >>>> Animal: {id, name, family, weight, foodid}
> >>>> Food: {id, feed}
> >>>>
> >>>> Three table variant:
> >>>> Animal: {id, name, family, weight}
> >>>> Food: {id, feed}
> >>>> AnimalFoodMap: {animalid, foodid}
> >>>>
> >>>> In the three table variant you join the animal and food tables using
> the
> >>>> AnimalFoodMap as "glue":
> >>>>
> >>>> SELECT name, family, weight, feed FROM Animal LEFT JOIN AnimalFoodMap
> ON
> >>>> animal.id = AnimalFoodMap.animalid LEFT JOIN Food ON
> >>>> AnimalFoodMap.foodid = Food.id;
> >>>>
> >>>> If an animal has more than one food type, the animal is listed once in
> >>>> the Animal table, the foods are each listed only once in the Food
> table
> >>>> and the AnimalFoodMap table contains the appropriate pairings.  Your
> >>>> example has more than one food per animal and also one repeated food
> so
> >>>> let's modify it:
> >>>>
> >>>> Animal (table):
> >>>> id  name    family   weight
> >>>> 1  Seymore  Snake    10
> >>>> 2  Gerard   Giraffe   120
> >>>>
> >>>> Food (table):
> >>>> id  feed
> >>>> 1   mice
> >>>> 2   leaves
> >>>> 3   grass
> >>>>
> >>>> Now the magic AnimalMapTable:
> >>>> animalid   foodid
> >>>> 1          1
> >>>> 1          2
> >>>> 2          2
> >>>> 2          3
> >>>>
> >>>> (Note that I'm skipping over some things such as keys that should be
> >>>> used to ensure there are no duplicates)
> >>>>
> >>>> After doing this there are several additional ways to write up queries
> >>>> to make an output similar to what you have (single line for an animal
> >>>> plus all the entries for the foods) which requires some more
> complicated
> >>>> query structure.  But for now, the query I wrote above would output:
> >>>>
> >>>> id name      family  feed
> >>>> 1  Seymore   Snake   mice
> >>>> 1  Seymore   Snake   leaves
> >>>> 2  Gerard    Giraffe leaves
> >>>> 2  Gerard    Giraffe grass
> >>>>
> >>>>
> >>>> On 2015-03-13 10:52, David S Jackson wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I'm a MySQL newbie.  I'm just starting to use Python to talk to MySQL
> >>>>> databases, but first I need to understand the MySQL query language!
> >>>>>
> >>>>>
> >>>>> So, I have two tables:  (zoo) animal and food.
> >>>>>
> >>>>> animal has the columns ID, NAME, FAMILY, WEIGHT
> >>>>>
> >>>>> food has  ID, ANIMALID, FEED
> >>>>>
> >>>>>
> >>>>> So if I go: select id, name, family, feed from animal JOIN food ON
> >>>>> animal.id=food.animalid;
> >>>>>
> >>>>>
> >>>>> I get something like:
> >>>>>
> >>>>> ID   NAME        FAMILY      FEED
> >>>>> 1    Seymore   Snake        mice, leaves
> >>>>> 2    Gerard      Giraffe       leaves, grass
> >>>>> ...
> >>>>> etc
> >>>>>
> >>>>> So, if I have a situation where I want to see whether an animal has
> >> been
> >>>>> added the animal table but may have not been added to the food table,
> >>>>> how would I compose that query?
> >>>>>
> >>>>> That is, the animal could have been added to the animal table and
> could
> >>>>> get an id, say 10, but the food table could have 20 foods already
> >>>>> entered and the animal-id would be used several times.
> >>>>>
> >>>>> I was thinking, is there a way I can ask, "does the highest
> animal.id
> >>>>> number equal the highest food.animalid number"?  If not, what animal
> is
> >>>>> not getting fed?
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20150323/a625c1e9/attachment.html>