Posts Tagged ‘Shannon-Weaver Information theory’

George A. Miller, a Pioneer in Cognitive Psychology, Is Dead at 92

Larry H. Bernstein, MD, FCAP, Curator

Leaders in Pharmaceutical Intelligence

Series E. 2; 5.10

5.10 George A. Miller, a Pioneer in Cognitive Psychology, Is Dead at 92



Miller started his education focusing on speech and language and published papers on these topics, focusing on mathematicalcomputational and psychological aspects of the field. He started his career at a time when the reigning theory in psychology was behaviorism, which eschewed any attempt to study mental processes and focused only on observable behavior. Working mostly at Harvard UniversityMIT and Princeton University, Miller introduced experimental techniques to study the psychology of mental processes, by linking the new field of cognitive psychology to the broader area of cognitive science, including computation theory and linguistics. He collaborated and co-authored work with other figures in cognitive science and psycholinguistics, such as Noam Chomsky. For moving psychology into the realm of mental processes and for aligning that move with information theory, computation theory, and linguistics, Miller is considered one of the great twentieth-century psychologists. A Review of General Psychology survey, published in 2002, ranked Miller as the 20th most cited psychologist of that era.[2]

Remembering George A. Miller

The human mind works a lot like a computer: It collects, saves, modifies, and retrieves information. George A. Miller, one of the founders of cognitive psychology, was a pioneer who recognized that the human mind can be understood using an information-processing model. His insights helped move psychological research beyond behaviorist methods that dominated the field through the 1950s. In 1991, he was awarded the National Medal of Science for his significant contributions to our understanding of the human mind.


Working memory

From the days of William James, psychologists had the idea memory consisted of short-term and long-term memory. While short-term memory was expected to be limited, its exact limits were not known. In 1956, Miller would quantify its capacity limit in the paper “The magical number seven, plus or minus two”. He tested immediate memory via tasks such as asking a person to repeat a set of digits presented; absolute judgment by presenting a stimulus and a label, and asking them to recall the label later; and span of attention by asking them to count things in a group of more than a few items quickly. For all three cases, Miller found the average limit to be seven items. He had mixed feelings about the focus on his work on the exact number seven for quantifying short-term memory, and felt it had been misquoted often. He stated, introducing the paper on the research for the first time, that he was being persecuted by an integer.[1] Miller also found humans remembered chunks of information, interrelating bits using some scheme, and the limit applied to chunks. Miller himself saw no relationship among the disparate tasks of immediate memory and absolute judgment, but lumped them to fill a one-hour presentation. The results influenced the budding field of cognitive psychology.[15]


For many years starting from 1986, Miller directed the development of WordNet, a large computer-readable electronic reference usable in applications such as search engines.[12] Wordnet is a dictionary of words showing their linkages by meaning. Its fundamental building block is a synset, which is a collection of synonyms representing a concept or idea. Words can be in multiple synsets. The entire class of synsets is grouped into nouns, verbs, adjectives and adverbs separately, with links existing only within these four major groups but not between them. Going beyond a thesaurus, WordNet also included inter-word relationships such as part/whole relationships and hierarchies of inclusion.[16] Miller and colleagues had planned the tool to test psycholinguistic theories on how humans use and understand words.[17] Miller also later worked closely with the developers at Simpli.com Inc., on a meaning-based keyword search engine based on WordNet.[18]

Language psychology and computation

Miller is considered one of the founders of psycholinguistics, which links language and cognition in psychology, to analyze how people use and create language.[1] His 1951 book Language and Communication is considered seminal in the field.[5] His later book, The Science of Words (1991) also focused on language psychology.[19] He published papers along with Noam Chomsky on the mathematics and computational aspects of language and its syntax, two new areas of study.[20][21][22] Miller also researched how people understood words and sentences, the same problem faced by artificial speech-recognition technology. The book Plans and the Structure of Behavior (1960), written with Eugene Galanter and Karl H. Pribram, explored how humans plan and act, trying to extrapolate this to how a robot could be programmed to plan and do things.[1] Miller is also known for coining Miller’s Law: “In order to understand what another person is saying, you must assume it is true and try to imagine what it could be true of”.[23]

Language and Communication, 1951[edit]

Miller’s Language and Communication was one of the first significant texts in the study of language behavior. The book was a scientific study of language, emphasizing quantitative data, and was based on the mathematical model of Claude Shannon‘s information theory.[24] It used a probabilistic model imposed on a learning-by-association scheme borrowed from behaviorism, with Miller not yet attached to a pure cognitive perspective.[25] The first part of the book reviewed information theory, the physiology and acoustics of phonetics, speech recognition and comprehension, and statistical techniques to analyze language.[24]The focus was more on speech generation than recognition.[25] The second part had the psychology: idiosyncratic differences across people in language use; developmental linguistics; the structure of word associations in people; use of symbolism in language; and social aspects of language use.[24]

Reviewing the book, Charles E. Osgood classified the book as a graduate-level text based more on objective facts than on theoretical constructs. He thought the book was verbose on some topics and too brief on others not directly related to the author’s expertise area. He was also critical of Miller’s use of simple, Skinnerian single-stage stimulus-response learning to explain human language acquisition and use. This approach, per Osgood, made it impossible to analyze the concept of meaning, and the idea of language consisting of representational signs. He did find the book objective in its emphasis on facts over theory, and depicting clearly application of information theory to psychology.[24]

Plans and the Structure of Behavior, 1960[edit]

In Plans and the Structure of Behavior, Miller and his co-authors tried to explain through an artificial-intelligence computational perspective how animals plan and act.[26] This was a radical break from behaviorism which explained behavior as a set or sequence of stimulus-response actions. The authors introduced a planning element controlling such actions.[27] They saw all plans as being executed based on input using a stored or inherited information of the environment (called the image), and using a strategy called test-operate-test-exit (TOTE). The image was essentially a stored memory of all past context, akin to Tolman‘scognitive map. The TOTE strategy, in its initial test phase, compared the input against the image; if there was incongruity the operate function attempted to reduce it. This cycle would be repeated till the incongruity vanished, and then the exit function would be invoked, passing control to another TOTE unit in a hierarchically arranged scheme.[26]

Peter Milner, in a review in the Canadian Journal of Psychology, noted the book was short on concrete details on implementing the TOTE strategy. He also critically viewed the book as not being able to tie its model to details from neurophysiology at a molecular level. Per him, the book covered only the brain at the gross level of lesion studies, showing that some of its regions could possibly implement some TOTE strategies, without giving a reader an indication as to how the region could implement the strategy.[26]

The Psychology of Communication, 1967[edit]

Miller’s 1967 work, The Psychology of Communication, was a collection of seven previously published articles. The first “Information and Memory” dealt with chunking, presenting the idea of separating physical length (the number of items presented to be learned) and psychological length (the number of ideas the recipient manages to categorize and summarize the items with). Capacity of short-term memory was measured in units of psychological length, arguing against a pure behaviorist interpretation since meaning of items, beyond reinforcement and punishment, was central to psychological length.[28]

The second essay was the paper on magical number seven. The third, ‘The human link in communication systems,’ used information theory and its idea of channel capacity to analyze human perception bandwidth. The essay concluded how much of what impinges on us we can absorb as knowledge was limited, for each property of the stimulus, to a handful of items.[28] The paper on “Psycholinguists” described how effort in both speaking or understanding a sentence was related to how much of self-reference to similar-structures-present-inside was there when the sentence was broken down into clauses and phrases.[29] The book, in general, used the Chomskian view of seeing language rules of grammar as having a biological basis—disproving the simple behaviorist idea that language performance improved with reinforcement—and using the tools of information and computation to place hypotheses on a sound theoretical framework and to analyze data practically and efficiently. Miller specifically addressed experimental data refuting the behaviorist framework at concept level in the field of language and cognition. He noted this only qualified behaviorism at the level of cognition, and did not overthrow it in other spheres of psychology.[28]


Read Full Post »

typical changes in CK-MB and cardiac troponin ...

typical changes in CK-MB and cardiac troponin in Acute Myocardial Infarction (Photo credit: Wikipedia)

Reporter and curator:

Larry H Bernstein, MD, FCAP

This posting is a followup on two previous posts covering the design and handling of HIT to improve healthcare outcomes as well as lower costs from better workflow and diagnostics, which is self-correcting over time.

The first example is a non technology method designed by Lee Goldman (Goldman Algorithm) that was later implemented at Cook County Hospital in Chicago with great success.     It has been known that there is over triage of patients to intensive care beds, adding to the costs of medical care.  If the differentiation between acute myocardial infarction and other causes of chest pain could be made more accurate, the quantity of scare resources used on unnecessary admissions could be reduced.  The Goldman algorithm was introduced in 1982 during a training phase at Yale-New Haven Hospital based on 482 patients, and later validated at the BWH (in Boston) on 468 patients.They demonstrated improvement in sensitivity as well as specificity (67% to 77%), and positive predictive value (34% to 42%).  They modified the computer derived algorithm in 1988 to achieve better results in triage of patients to the ICU of patients with chest pain based on a study group of 1379 patients.  The process was tested prospectively on 4770 patients at two university and 4 community hospitals.  The specificity improved by 74% vs 71% in recognizing absence of AMI by the algorithm vs physician judgement. The sensitivity was not different for admission (88%).  Decisions based solely on the protocol would have decreased admissions of patients without AMI by 11.5% without adverse effects.  The study was repeated by Qamar et al. with equal success.

Pain in acute myocardial infarction (front)

Pain in acute myocardial infarction (front) (Photo credit: Wikipedia)

An ECG showing pardee waves indicating acute m...

An ECG showing pardee waves indicating acute myocardial infarction in the inferior leads II, III and aVF with reciprocal changes in the anterolateral leads. (Photo credit: Wikipedia)

Acute myocardial infarction with coagulative n...

Acute myocardial infarction with coagulative necrosis (4) (Photo credit: Wikipedia)

Goldman L, Cook EF, Brand DA, Lee TH, Rouan GW, Weisberg MC, et al. A computer protocol to predict myocardial infarction in emergency department patients with chest pain. N Engl J Med. 1988;318:797-803.

A Qamar, C McPherson, J Babb, L Bernstein, M Werdmann, D Yasick, S Zarich. The Goldman algorithm revisited: prospective evaluation of a computer-derived algorithm versus unaided physician judgment in suspected acute myocardial infarction. Am Heart J 1999; 138(4 Pt 1):705-709. ICID: 825629

The usual accepted method for determining the decision value of a predictive variable is the Receiver Operator Characteristic Curve, which requires a mapping of each value of the variable against the percent with disease on the Y-axis.   This requires a review of every case entered into the study.  The ROC curve is done to validate a study to classify data on leukemia markers for research purposes as shown by Jay Magidson in his demonstation of  Correlated Component Regression (2012)(Statistical Innovations, Inc.)  The test for the contribution of each predictor is measured by Akaike Information Criteria and Bayes Information Criteria, which have proved to be critically essential tests over the last 20 years.

I go back 20 years and revisit the application of these principles in clinical diagnostics, but the ROC was introduced to medicine in radiology earlier.   A full rendering of this matter can be found in the following:
R A Rudolph, L H Bernstein, J Babb. Information induction for predicting acute myocardial infarction.Clin Chem 1988; 34(10):2031-2038. ICID: 825568.

Rypka EW. Methods to evaluate and develop the decision process in the selection of tests. Clin Lab Med 1992; 12:355

Rypka EW. Syndromic Classification: A process for amplifying information using S-Clustering. Nutrition 1996;12(11/12):827-9.

Christianson R. Foundations of inductive reasoning. 1964.  Entropy Publications. Lincoln, MA.

Inability to classify information is a major problem in deriving and validating hypotheses from PRIMARY data sets necessary to establish a measure of outcome effectiveness.  When using quantitative data, decision limits have to be determined that best distinguish the populations investigated.   We are concerned with accurate assignment into uniquely verifiable groups by information in test relationships.  Uncertainty in assigning to a supervisory classification can only be relieved by providing suffiuciuent data.

A method for examining the endogenous information in the data is used to determine decision points.  The reference or null set is defined as a class having no information.  When information is present in the data, the entropy (uncertainty in the data set) is reduced by the amount of information provided.  This is measureable and may be referred to as the Kullback-Liebler distance, which was extended by Akaike to include statistical theory.   An approach is devised using EW Rypka’s S-Clustering has been created to find optimal decision values that separate the groups being classified.  Further, it is possible to obtain PRIMARY data on-line and continually creating primary classifications (learning matrices).  From the primary classifications test-minimized sets of features are determined with optimal useful and sufficient information for accurately distinguishing elements (patients).  Primary classifications can be continually created from PRIMARY data.  More recent and complex work in classifying hematology data using a 30,000 patient data set and 16 variables to identify the anemias, moderate SIRS, sepsis, lymphocytic and platelet disorders has been  published and recently presented.  Another classification for malnutrition and stress hypermetabolism is now validated and in press in the journal Nutrition (2012), Elsevier.
G David, LH Bernstein, RR Coifman. Generating Evidence Based Interpretation of Hematology Screens via Anomaly Characterization. Open Clinical Chemistry Journal 2011; 4 (1):10-16. 1874-2416/11 2011 Bentham Open.  ICID: 939928

G David; LH Bernstein; RR Coifman. The Automated Malnutrition Assessment. Accepted 29 April 2012.
http://www.nutritionjrnl.com. Nutrition (2012), doi:10.1016/j.nut.2012.04.017.

Keywords: Network Algorithm; unsupervised classification; malnutrition screening; protein energy malnutrition (PEM); malnutrition risk; characteristic metric; characteristic profile; data characterization; non-linear differential diagnosis

Summary: We propose an automated nutritional assessment (ANA) algorithm that provides a method for malnutrition risk prediction with high accuracy and reliability. The problem of rapidly identifying risk and severity of malnutrition is crucial for minimizing medical and surgical complications. We characterized for each patient a unique profile and mapped similar patients into a classification. We also found that the laboratory parameters were sufficient for the automated risk prediction.
We here propose a simple, workable algorithm that provides assistance for interpreting any set of data from the screen of a blood analysis with high accuracy, reliability, and inter-operability with an electronic medical record. This has been made possible at least recently as a result of advances in mathematics, low computational costs, and rapid transmission of the necessary data for computation.  In this example, acute myocardial infarction (AMI) is classified using isoenzyme CKMB activity, total LD, and isoenzyme LD-1, and repeated studies have shown the high power of laboratory features for diagnosis of AMI, especially with NSTEMI.  A later study includes the scale values for chest pain and for ECG changes to create the model.

LH Bernstein, A Qamar, C McPherson, S Zarich.  Evaluating a new graphical ordinal logit method (GOLDminer) in the diagnosis of myocardial infarction utilizing clinical features and laboratory data. Yale J Biol Med 1999; 72(4):259-268. ICID: 825617

The quantitative measure of information, Shannon entropy treats data as a message transmission.  We are interested in classifying data with near errorless discrimination.  The method assigns upper limits of normal to tests computed from Rudolph’s maximum entropy definitions of group-based normal reference.  Using the Bernoulli trial to determine maximum entropy reference, we determine from the entropy in the data a probability of a positive result that is the same for each test and conditionally independent of other results by setting the binary decision level for each test.  The entropy of the discrete distribution is calculated from the probabilities of the distribution. The probability distribution of the binary patterns is not flat and the entropy decreases when there is information in the data.  The decrease in entropy is the Kullback-Liebler distance.

The basic principle of separatory clustering is extracting features from endogenous data that amplify or maximize structural information into disjointed or separable classes.  This differs from other methods because it finds in a database a theoretic – or more – number of variables with required VARIETY that map closest to an ideal, theoretic, or structural information standard. Scaling allows using variables with different numbers of message choices (number bases) in the same matrix, binary, ternary, etc (representing yes-no; small-modest, large, largest).   The ideal number of class is defined by x^n.   In viewing a variable value we think of it as low, normal, high, high high, etc.  A system works with related parts in harmony.  This frame of reference improves the applicability of S-clustering.  By definition, a unit of information is log.r r = 1.

The method of creating a syndromic classification to control variety in the system also performs a semantic function by attributing a term to a Port Royal Class.  If any of the attributes are removed, the meaning of the class is made meaningless.  Any significant overlap between the groups would be improved by adding requisite variety.  S-clustering is an objective and most desirable way to find the shortest route to diagnosis, and is an objective way of determining practice parameters.

Multiple Test Binary Decision Patterns where CK-MB = 18 u/l, LD-1 = 36 u/l, %LD1 = 32 u/l.

No.               Pattern       Freq                   P1                       Self information                Weighted information

0                    000             26                   0.1831                    2.4493                                     0.4485
1                    001                3                    0.0211                   5.5648                                     0.1176
2                    010               4                    0.0282                   5.1497                                     0.1451
3                    011                2                    0.0282                   6.1497                                     0.0866
4                    100               6                    0.0423                   4.5648                                     0.1929
6                    110                8                    0.0563                  4.1497                                     0.2338
7                    111               93                   0.6549                  0.6106                                     0.3999

Entropy: sum of weighted information (average)           1.6243 bits

The effective information values are the least-error points. Non AMI patients exhibit patterns 0, 1, 2, 3, and 4: AMI patients are 6 and 7.  There is 1 fp 4, and 1 fn 6.  The error rate is 1.4%.


A major problem in using quantitative data is lack of a justifiable definition of reference (normal).  Our information model consists of a population group, a set of attributes derived from observations, and basic definitions using Shannon’s information measure entropy. In this model, the population set and its values for its variables are considered to be the only information available.  The finding of a flat distribution with the Bernoulli test defines the reference population that has no information.  The complementary syndromic group, treated in the same way, produces a distribution that is not flat and has a less than maximum information uncertainty.

The vector of probabilities – (1/2), (1/2), …(1/2), can be related to the path calculated from the Rypka-Fletcher equation, which

Ct = 1 – 2^-k/1 -2^-n

determines the theoretical maximum comprehension from the test of n attributes.  We constructed a ROC curve from theoriginal IRIS  data of R Fisher from four measurements of leaf and petal with a result obtained using information-based induction principles to determine discriminant points without the classification that had to be used for the discriminant analysis.   The principle of maximum entropy, as formu;ated by Jaynes and Tribus proposes that for problems of statistical inference – which as defined, are problems of induction – the probabilities should be assigned so that the entropy function is maximized.  Good proposed that maximum entropy be used to define the null hypothesis and Rudolph proposed that medical reference be defined as at maximum entropy.

Rudolph RA. A general purpose information processing automation: generating Port Royal Classes with probabilistic information. Intl Proc Soc Gen systems Res 1985;2:624-30.

Jaynes ET. Information theory and statistical mechanics. Phys Rev 1956;106:620-30.

Tribus M. Where do we stand after 30 years of maximum entropy? In: Levine RD, Tribus M, eds. The maximum entropy formalism. Cambridge, Ma: MIT Press, 1978.

Good IJ. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Ann Math Stat 1963;34:911-34.

The most important reason for using as many tests as is practicable is derived from the prominent role of redundancy in transmitting information (Noisy Channel Theorem).  The proof of this theorem does not tell how to accomplish nearly errorless discrimination, but redundancy is essential.

In conclusion, we have been using the effective information (derived from Kullback-Liebler distance) provided by more than one test to determine normal reference and locate decision values.  Syndromes and patterns that are extracted are empirically verifiable.

Related articles

Read Full Post »

%d bloggers like this: