[ale] MySQL help
Charles Shapiro
hooterpincher at gmail.com
Mon Mar 16 09:51:22 EDT 2015
'Ray Alex! Your explanation is spot-on.
-- CHS
On Sat, Mar 14, 2015 at 7:09 PM, Alex Carver <agcarver+ale at acarver.net>
wrote:
> I know it was an assignment so you're limited but it was something to
> keep in mind as the construction of that database is not the best
> example of a proper design.
>
> The rule isn't for MySQL specifically, it's for database design in
> general and is called Normal Form. It applies to all relational
> databases independent of the engines that support them (MySQL, Postgres,
> etc.) The concept isn't that the data is atomic more that it is not
> repeated unnecessarily which maximizes storage (this is still a critical
> thing for hypergiant databases used by NASA, Google, Amazon, etc.) and
> also allows better query structure and operation.
>
> There are multiple levels of Normal Form starting with Zero, which has
> no optimizations, going up to Five, which has the most (and sometimes
> very complicated) optimizations (there are sometimes "extra" unofficial
> levels referenced above Five but they dont' make much sense.) Where you
> stop all depends on how easy it is to visualize the database structure
> at the higher NF levels (so that it can be implemented) and whether the
> data is a natural fit. In most cases you end up with a hybrid database
> where most is at some NF level and small portions are at a lower level.
>
> The most common textbook examples for addressing NF start off with basic
> address books/contacts lists. You start with an address book with a
> person's name and postal address:
>
> John Smith 123 Main St, Springville, IL 12345
> Jane Doe 456 Side St, Springhills, IL 12345
> Jim Johns 789 Road St, Springvalley, IL 12345
> Jill Dale 123 High Rd, Fallglen, OH 23456
> Joe Jims 456 Low Rd, Fallcrest, OH 23456
>
> In a zero NF database, these would be recorded one at a time, lots of
> repeated information (obviously the ZIP and state). So it's obvious
> that there is a minimization that can occur with the state at a minimum
> by breaking it out into a table and then relating the address book table
> with the state table. Now when you store a state, you only have an
> integer between 0 and 49 (assuming only the 50 states) and integer
> storage is more efficient than a string of two characters in most
> database engines.
>
> A less obvious one is the ZIP which actually could be broken into a
> small set of foreign lookup tables. The number of ZIPs is limited and
> only certain ZIPs exist in certain states because the US is divided into
> 10 regions. So you could have at least one table that matched a ZIP
> with a state and use that in the address book instead. Then you store
> the ZIP only once, the state is already matched to the ZIP and thus the
> state is only stored once. The address book then only needs to point to
> the ZIP table and the ZIP table maps to the state so you can recreate
> the state information without having to have a "state" column in the
> address book.
>
> By example:
> ZIP_State_Table:
> ID ZIP State
> 1 12345 1
> 2 12346 1
> 3 23456 2
> 4 23457 2
>
> State_Table:
> ID State
> 1 IL
> 2 OH
>
> Now the address book only needs:
> Name, Street Address, City, ZIP_ID
>
> Note that the state field is not required. You get the state by
> chaining the three tables together. And the ZIP doesn't have to be
> explicitly stored, just the ID (so that other postal codes could be used
> later such as Canadian). You could extend this to a table of Cities
> since ZIPs encode onto cities and that would eliminate the need to store
> the city in the address book. This is getting closer to Four and Five
> NF since you're abstracting a lot of data and really paring down the
> original table. The reality is that most places like Amazon do exactly
> this. You don't need to store everyone's city over and over (and waste
> the space doing so) because you know every city. Just record them once
> and point to them.
>
> The other part of NF is where pointers should go. This is less an
> atomicity of data and more of a logical structuring. For your database,
> all animals need food so the animal table should really carry the food.
> The food table itself is only the NF conversion to minimize data
> replication. Since foods can be eaten by more than one animal, the
> current arrangement of your example tables (with animal ID in the food
> table) causes unnecessary replication of data (a food gets repeated)
> which goes against NF rules. The map table I suggested goes up another
> level of NF because it avoids replication of data in the animal table,
> too, since one animal can eat more than one food. But, of the two
> evils, replicating the animal was the lesser evil because it's more
> logical in a database sense compared to replicating food mainly because
> the food is more likely to be replicated with greater frequency than the
> animals (any individual animal might eat less than five food items but
> any food item can potentially be eaten by an entire zoo's population so
> you have many more potential repetitions of the food than the animal).[1]
>
> In the end there's no real limit to how far you should go in terms of NF
> other than whether it makes implementing the database much harder than
> needed for the task. For Amazon, high level NF is crucial otherwise
> they'll burn through disk space in a hurry. For a small DB, maybe not
> critical at all. My personal address book might only be 50 entries so
> do I need to spend the time doing it? Not really, there's no savings
> for me except that it's just good practice.
>
> NF is not hard and fast but it's very useful for making databases work
> better. It also allows you to eventually slice and dice your data in
> ways that might otherwise be very difficult to achieve with flat or low
> NF databases. It's harder to create an example of this since easy to
> type and follow database structures don't lend themselves to difficult
> analysis (by design). But big research databases are a prime place
> where this happens. A researcher may piece together data in various
> ways to extract trends but this would be difficult to do if the database
> was a monolithic table or otherwise used a very low order NF that didn't
> break out all the data as much as possible.
>
>
> [1] Another way to visualize this is to switch the animals to zoo
> visitors (humans) and the food that the visitors eat at the zoo (which
> is usually a limited set of park fare like hot dogs, hamburgers, chicken
> wings, soda, etc.). If you had to track every visitor and what they ate
> for lunch, are you more likely to put the pointer to food in the visitor
> table or a pointer to the visitor in the food table? If you put the
> food pointer in the visitor table, then each visitor might have two or
> three records (one visitor bought a soda, a hot dog, and a soft pretzel
> so three records for John Doe). The food table here would still only
> have the few items you sell at the zoo (hot dog, hamburger, chicken,
> soda, water, pretzel, ice cream.)
>
> But how many of those visitors at a busy zoo bought a hot dog? If the
> pointer is reversed and your food table points at a visitor then your
> food table might suddenly have 10,000 hot dog entries.
>
> If your zoo gets 100,000 people a day (big zoo) and every person buys
> three food items on average then your day's table of visitors will be
> 300,000 records and the food table has just 7 using the first
> arrangement (visitor points to food). If you reverse the pointer (food
> points to visitor), your visitor table is 100,000 records (every
> visitor) and each item in your food table might have been purchased by a
> third of the visitors for a total of over 230,000 records.
>
> The final tally, visitor points to food is 300,007 records and food
> points to visitor is 330,000 records. Net savings of 30,000 records.
>
>
> On 2015-03-14 14:34, David Jackson wrote:
> > Thanks Alex! Wow. Very impressive analysis!
> >
> > This has been for an assignment, so I couldn't change the tables, but
> your
> > way of gluing the info together makes sense to me, so thanks! I think
> > they're just trying to keep it simple for the class
> >
> > It looks like for MySQL you want to "atomize" the data as much as
> possible
> > to keep the structure out of the way of possible relations among the
> > data?? I guess it's a judgment call to not get too "atomic" though, huh?
> >
> > Dave
> >
> >
> > On Fri, Mar 13, 2015 at 5:55 PM, Alex Carver <agcarver+ale at acarver.net>
> > wrote:
> >
> >> There are four JOINs available to you, LEFT, RIGHT, INNER (this one is
> >> usually just the unadorned JOIN keyword) and OUTER.
> >>
> >> To see if you have an animal with no food and given that your food table
> >> is on the right of the animal table in the query you offered you need to
> >> use a LEFT JOIN. The LEFT JOIN ensures that the left table takes
> >> precedence during row generation such that missing matches result in
> >> NULLs in the fields from the right-side tables:
> >>
> >> So change your query to:
> >>
> >> SELECT id, name, family, feed FROM animal LEFT JOIN food ON animal.id =
> >> food.animalid;
> >>
> >> That gives you everything. Then add a WHERE clause to filter out
> >> animals missing food:
> >>
> >> SELECT id, name, family, feed FROM animal LEFT JOIN food ON animal.id =
> >> food.animalid WHERE feed IS NULL;
> >>
> >> The unadorned JOIN (which is a INNER or STRAIGHT JOIN) will only show
> >> records where there are matches in all joined tables. The RIGHT JOIN
> >> variant will reverse the precedence of the tables (LEFT and RIGHT JOINs
> >> work the same, the precedence is reversed and it's mainly used when you
> >> join many tables so that you get the precedence correct for the needed
> >> query).
> >>
> >> On an aside, the table as constructed isn't quite proper according to
> >> the rules of Normal Form (how databases are laid out). Your food table
> >> should not have an animal ID. Instead your animal should have a food ID
> >> if it can only eat one type of food. Alternatively there should be a
> >> third table which maps animal ID to food ID that way an animal can eat
> >> more than one food.
> >>
> >> Two table variant:
> >> Animal: {id, name, family, weight, foodid}
> >> Food: {id, feed}
> >>
> >> Three table variant:
> >> Animal: {id, name, family, weight}
> >> Food: {id, feed}
> >> AnimalFoodMap: {animalid, foodid}
> >>
> >> In the three table variant you join the animal and food tables using the
> >> AnimalFoodMap as "glue":
> >>
> >> SELECT name, family, weight, feed FROM Animal LEFT JOIN AnimalFoodMap ON
> >> animal.id = AnimalFoodMap.animalid LEFT JOIN Food ON
> >> AnimalFoodMap.foodid = Food.id;
> >>
> >> If an animal has more than one food type, the animal is listed once in
> >> the Animal table, the foods are each listed only once in the Food table
> >> and the AnimalFoodMap table contains the appropriate pairings. Your
> >> example has more than one food per animal and also one repeated food so
> >> let's modify it:
> >>
> >> Animal (table):
> >> id name family weight
> >> 1 Seymore Snake 10
> >> 2 Gerard Giraffe 120
> >>
> >> Food (table):
> >> id feed
> >> 1 mice
> >> 2 leaves
> >> 3 grass
> >>
> >> Now the magic AnimalMapTable:
> >> animalid foodid
> >> 1 1
> >> 1 2
> >> 2 2
> >> 2 3
> >>
> >> (Note that I'm skipping over some things such as keys that should be
> >> used to ensure there are no duplicates)
> >>
> >> After doing this there are several additional ways to write up queries
> >> to make an output similar to what you have (single line for an animal
> >> plus all the entries for the foods) which requires some more complicated
> >> query structure. But for now, the query I wrote above would output:
> >>
> >> id name family feed
> >> 1 Seymore Snake mice
> >> 1 Seymore Snake leaves
> >> 2 Gerard Giraffe leaves
> >> 2 Gerard Giraffe grass
> >>
> >>
> >> On 2015-03-13 10:52, David S Jackson wrote:
> >>> Hi,
> >>>
> >>> I'm a MySQL newbie. I'm just starting to use Python to talk to MySQL
> >>> databases, but first I need to understand the MySQL query language!
> >>>
> >>>
> >>> So, I have two tables: (zoo) animal and food.
> >>>
> >>> animal has the columns ID, NAME, FAMILY, WEIGHT
> >>>
> >>> food has ID, ANIMALID, FEED
> >>>
> >>>
> >>> So if I go: select id, name, family, feed from animal JOIN food ON
> >>> animal.id=food.animalid;
> >>>
> >>>
> >>> I get something like:
> >>>
> >>> ID NAME FAMILY FEED
> >>> 1 Seymore Snake mice, leaves
> >>> 2 Gerard Giraffe leaves, grass
> >>> ...
> >>> etc
> >>>
> >>> So, if I have a situation where I want to see whether an animal has
> been
> >>> added the animal table but may have not been added to the food table,
> >>> how would I compose that query?
> >>>
> >>> That is, the animal could have been added to the animal table and could
> >>> get an id, say 10, but the food table could have 20 foods already
> >>> entered and the animal-id would be used several times.
> >>>
> >>> I was thinking, is there a way I can ask, "does the highest animal.id
> >>> number equal the highest food.animalid number"? If not, what animal is
> >>> not getting fed?
> >>
> >> _______________________________________________
> >> Ale mailing list
> >> Ale at ale.org
> >> http://mail.ale.org/mailman/listinfo/ale
> >> See JOBS, ANNOUNCE and SCHOOLS lists at
> >> http://mail.ale.org/mailman/listinfo
> >>
> >
> >
> >
> > _______________________________________________
> > Ale mailing list
> > Ale at ale.org
> > http://mail.ale.org/mailman/listinfo/ale
> > See JOBS, ANNOUNCE and SCHOOLS lists at
> > http://mail.ale.org/mailman/listinfo
> >
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20150316/7d3a8095/attachment.html>
More information about the Ale
mailing list