[ale] MySQL help

Alex Carver agcarver+ale at acarver.net
Sat Mar 14 19:09:09 EDT 2015


I know it was an assignment so you're limited but it was something to
keep in mind as the construction of that database is not the best
example of a proper design.

The rule isn't for MySQL specifically, it's for database design in
general and is called Normal Form.  It applies to all relational
databases independent of the engines that support them (MySQL, Postgres,
etc.)  The concept isn't that the data is atomic more that it is not
repeated unnecessarily which maximizes storage (this is still a critical
thing for hypergiant databases used by NASA, Google, Amazon, etc.) and
also allows better query structure and operation.

There are multiple levels of Normal Form starting with Zero, which has
no optimizations, going up to Five, which has the most (and sometimes
very complicated) optimizations (there are sometimes "extra" unofficial
levels referenced above Five but they dont' make much sense.)  Where you
stop all depends on how easy it is to visualize the database structure
at the higher NF levels (so that it can be implemented) and whether the
data is a natural fit.  In most cases you end up with a hybrid database
where most is at some NF level and small portions are at a lower level.

The most common textbook examples for addressing NF start off with basic
address books/contacts lists.  You start with an address book with a
person's name and postal address:

John Smith  123 Main St, Springville, IL 12345
Jane Doe    456 Side St, Springhills, IL 12345
Jim  Johns  789 Road St, Springvalley, IL 12345
Jill Dale   123 High Rd, Fallglen, OH 23456
Joe  Jims   456 Low Rd,  Fallcrest, OH 23456

In a zero NF database, these would be recorded one at a time, lots of
repeated information (obviously the ZIP and state).  So it's obvious
that there is a minimization that can occur with the state at a minimum
by breaking it out into a table and then relating the address book table
with the state table.  Now when you store a state, you only have an
integer between 0 and 49 (assuming only the 50 states) and integer
storage is more efficient than a string of two characters in most
database engines.

A less obvious one is the ZIP which actually could be broken into a
small set of foreign lookup tables.  The number of ZIPs is limited and
only certain ZIPs exist in certain states because the US is divided into
10 regions.  So you could have at least one table that matched a ZIP
with a state and use that in the address book instead.  Then you store
the ZIP only once, the state is already matched to the ZIP and thus the
state is only stored once.  The address book then only needs to point to
the ZIP table and the ZIP table maps to the state so you can recreate
the state information without having to have a "state" column in the
address book.

By example:
ZIP_State_Table:
ID  ZIP      State
1   12345    1
2   12346    1
3   23456    2
4   23457    2

State_Table:
ID       State
1        IL
2        OH

Now the address book only needs:
Name, Street Address, City, ZIP_ID

Note that the state field is not required.  You get the state by
chaining the three tables together.  And the ZIP doesn't have to be
explicitly stored, just the ID (so that other postal codes could be used
later such as Canadian).  You could extend this to a table of Cities
since ZIPs encode onto cities and that would eliminate the need to store
the city in the address book.  This is getting closer to Four and Five
NF since you're abstracting a lot of data and really paring down the
original table.  The reality is that most places like Amazon do exactly
this.  You don't need to store everyone's city over and over (and waste
the space doing so) because you know every city.  Just record them once
and point to them.

The other part of NF is where pointers should go.  This is less an
atomicity of data and more of a logical structuring.  For your database,
all animals need food so the animal table should really carry the food.
 The food table itself is only the NF conversion to minimize data
replication.  Since foods can be eaten by more than one animal, the
current arrangement of your example tables (with animal ID in the food
table) causes unnecessary replication of data (a food gets repeated)
which goes against NF rules.  The map table I suggested goes up another
level of NF because it avoids replication of data in the animal table,
too, since one animal can eat more than one food.  But, of the two
evils, replicating the animal was the lesser evil because it's more
logical in a database sense compared to replicating food mainly because
the food is more likely to be replicated with greater frequency than the
animals (any individual animal might eat less than five food items but
any food item can potentially be eaten by an entire zoo's population so
you have many more potential repetitions of the food than the animal).[1]

In the end there's no real limit to how far you should go in terms of NF
other than whether it makes implementing the database much harder than
needed for the task.  For Amazon, high level NF is crucial otherwise
they'll burn through disk space in a hurry.  For a small DB, maybe not
critical at all.  My personal address book might only be 50 entries so
do I need to spend the time doing it?  Not really, there's no savings
for me except that it's just good practice.

NF is not hard and fast but it's very useful for making databases work
better.  It also allows you to eventually slice and dice your data in
ways that might otherwise be very difficult to achieve with flat or low
NF databases.  It's harder to create an example of this since easy to
type and follow database structures don't lend themselves to difficult
analysis (by design).  But big research databases are a prime place
where this happens.  A researcher may piece together data in various
ways to extract trends but this would be difficult to do if the database
was a monolithic table or otherwise used a very low order NF that didn't
break out all the data as much as possible.


[1] Another way to visualize this is to switch the animals to zoo
visitors (humans) and the food that the visitors eat at the zoo (which
is usually a limited set of park fare like hot dogs, hamburgers, chicken
wings, soda, etc.).  If you had to track every visitor and what they ate
for lunch, are you more likely to put the pointer to food in the visitor
table or a pointer to the visitor in the food table?  If you put the
food pointer in the visitor table, then each visitor might have two or
three records (one visitor bought a soda, a hot dog, and a soft pretzel
so three records for John Doe).  The food table here would still only
have the few items you sell  at the zoo (hot dog, hamburger, chicken,
soda, water, pretzel, ice cream.)

But how many of those visitors at a busy zoo bought a hot dog?  If the
pointer is reversed and your food table points at a visitor then your
food table might suddenly have 10,000 hot dog entries.

If your zoo gets 100,000 people a day (big zoo) and every person buys
three food items on average then your day's table of visitors will be
300,000 records and the food table has just 7 using the first
arrangement (visitor points to food).  If you reverse the pointer (food
points to visitor), your visitor table is 100,000 records (every
visitor) and each item in your food table might have been purchased by a
third of the visitors for a total of over 230,000 records.

The final tally, visitor points to food is 300,007 records and food
points to visitor is 330,000 records.  Net savings of 30,000 records.


On 2015-03-14 14:34, David Jackson wrote:
> Thanks Alex!  Wow.  Very impressive analysis!
> 
> This has been for an assignment, so I couldn't change the tables, but your
> way of gluing the info together makes sense to me, so thanks!  I think
> they're just trying to keep it simple for the class
> 
> It looks like for MySQL you want to "atomize" the data as much as possible
> to keep the structure out of the way of possible relations among the
> data??  I guess it's a judgment call to not get too "atomic" though, huh?
> 
> Dave
> 
> 
> On Fri, Mar 13, 2015 at 5:55 PM, Alex Carver <agcarver+ale at acarver.net>
> wrote:
> 
>> There are four JOINs available to you, LEFT, RIGHT, INNER (this one is
>> usually just the unadorned JOIN keyword) and OUTER.
>>
>> To see if you have an animal with no food and given that your food table
>> is on the right of the animal table in the query you offered you need to
>> use a LEFT JOIN.  The LEFT JOIN ensures that the left table takes
>> precedence during row generation such that missing matches result in
>> NULLs in the fields from the right-side tables:
>>
>> So change your query to:
>>
>> SELECT id, name, family, feed FROM animal LEFT JOIN food ON animal.id =
>> food.animalid;
>>
>> That gives you everything.  Then add a WHERE clause to filter out
>> animals missing food:
>>
>> SELECT id, name, family, feed FROM animal LEFT JOIN food ON animal.id =
>> food.animalid WHERE feed IS NULL;
>>
>> The unadorned JOIN (which is a INNER or STRAIGHT JOIN) will only show
>> records where there are matches in all joined tables.  The RIGHT JOIN
>> variant will reverse the precedence of the tables (LEFT and RIGHT JOINs
>> work the same, the precedence is reversed and it's mainly used when you
>> join many tables so that you get the precedence correct for the needed
>> query).
>>
>> On an aside, the table as constructed isn't quite proper according to
>> the rules of Normal Form (how databases are laid out).  Your food table
>> should not have an animal ID.  Instead your animal should have a food ID
>> if it can only eat one type of food.  Alternatively there should be a
>> third table which maps animal ID to food ID that way an animal can eat
>> more than one food.
>>
>> Two table variant:
>> Animal: {id, name, family, weight, foodid}
>> Food: {id, feed}
>>
>> Three table variant:
>> Animal: {id, name, family, weight}
>> Food: {id, feed}
>> AnimalFoodMap: {animalid, foodid}
>>
>> In the three table variant you join the animal and food tables using the
>> AnimalFoodMap as "glue":
>>
>> SELECT name, family, weight, feed FROM Animal LEFT JOIN AnimalFoodMap ON
>> animal.id = AnimalFoodMap.animalid LEFT JOIN Food ON
>> AnimalFoodMap.foodid = Food.id;
>>
>> If an animal has more than one food type, the animal is listed once in
>> the Animal table, the foods are each listed only once in the Food table
>> and the AnimalFoodMap table contains the appropriate pairings.  Your
>> example has more than one food per animal and also one repeated food so
>> let's modify it:
>>
>> Animal (table):
>> id  name    family   weight
>> 1  Seymore  Snake    10
>> 2  Gerard   Giraffe   120
>>
>> Food (table):
>> id  feed
>> 1   mice
>> 2   leaves
>> 3   grass
>>
>> Now the magic AnimalMapTable:
>> animalid   foodid
>> 1          1
>> 1          2
>> 2          2
>> 2          3
>>
>> (Note that I'm skipping over some things such as keys that should be
>> used to ensure there are no duplicates)
>>
>> After doing this there are several additional ways to write up queries
>> to make an output similar to what you have (single line for an animal
>> plus all the entries for the foods) which requires some more complicated
>> query structure.  But for now, the query I wrote above would output:
>>
>> id name      family  feed
>> 1  Seymore   Snake   mice
>> 1  Seymore   Snake   leaves
>> 2  Gerard    Giraffe leaves
>> 2  Gerard    Giraffe grass
>>
>>
>> On 2015-03-13 10:52, David S Jackson wrote:
>>> Hi,
>>>
>>> I'm a MySQL newbie.  I'm just starting to use Python to talk to MySQL
>>> databases, but first I need to understand the MySQL query language!
>>>
>>>
>>> So, I have two tables:  (zoo) animal and food.
>>>
>>> animal has the columns ID, NAME, FAMILY, WEIGHT
>>>
>>> food has  ID, ANIMALID, FEED
>>>
>>>
>>> So if I go: select id, name, family, feed from animal JOIN food ON
>>> animal.id=food.animalid;
>>>
>>>
>>> I get something like:
>>>
>>> ID   NAME        FAMILY      FEED
>>> 1    Seymore   Snake        mice, leaves
>>> 2    Gerard      Giraffe       leaves, grass
>>> ...
>>> etc
>>>
>>> So, if I have a situation where I want to see whether an animal has been
>>> added the animal table but may have not been added to the food table,
>>> how would I compose that query?
>>>
>>> That is, the animal could have been added to the animal table and could
>>> get an id, say 10, but the food table could have 20 foods already
>>> entered and the animal-id would be used several times.
>>>
>>> I was thinking, is there a way I can ask, "does the highest animal.id
>>> number equal the highest food.animalid number"?  If not, what animal is
>>> not getting fed?
>>
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org
>> http://mail.ale.org/mailman/listinfo/ale
>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> http://mail.ale.org/mailman/listinfo
>>
> 
> 
> 
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
> 



More information about the Ale mailing list