U.S. Appeals and Supreme Court dataset for: Judicial hierarchy and discursive influence
Data files
Oct 03, 2023 version files 8.08 GB
-
gerow_acl_processed.zip
19.88 MB
-
harv_final_aut_corr_cits.csv
40.53 MB
-
harv_final.csv
4.89 GB
-
harv_processed.zip
3.13 GB
-
README.md
6.39 KB
Abstract
This dataset contains written opinions from the 11 numbered Courts of Appeals and the DC Circuit Court of Appeals (not including the Federal Circuit), as well as the SCOTUS. It also contains metadata pertaining to each opinion, such as author, year, etc.
It also contains the processed outputs of the rDIM model (Gerow et. al. 2018) pertaining to the experiments performed in our paper. These results contain the assigned influence and topic distribution for each case.
README: United States Federal Appeals and SCOTUS precedential opinions from 1970-2010 and rDIM outputs
This dataset contains written opinions from the 11 numbered Courts of Appeals and the DC Circuit Court of Appeals (not including the Federal Circuit!), as well as the SCOTUS. It also contains metadata pertaining to each opinion, such as author, year, etc.
It also contains the processed outputs of the rDIM model (Gerow et. al. 2018) pertaining to the experiments performed in our paper. These results contain the assigned influence and topic distribution for each case.
Description of the data and file structure
Input data files
There are two main data files. The first is called harv_final.csv. It contains the written opinions and some basic metadata,
enough to run the rDIM model. The columns are:
- id: the unique id attributed to that opinion by the Harvard Caselaw Access Project (Caselaw) (for example, this case https://api.case.law/v1/cases/11301409/ has id 11301409)
- Name: the not-necessarily-unique name of the case from Caselaw
- Year: the year in which the opinion was published
- Text: the raw (uncleaned) text from Caselaw
- Author: the raw (noisy) name of the author(s) of the opinion
- Citation: the theoretically unique identifier of each opinion, composed of a triple: (volume number of the published opinion, reporter abbreviation, first page of opinion in volume). For example 174 F.3d 599, where 174 is the volume, F. 3d is the reporter, 599 is the page.
- riddelName: format of Citation in the format published in The Supreme Court and the Judicial Genre, Livermore et. al. 2017 (redundant)
- url: the url of the opinion on Caselaw's website
- Circuit: the name of the court in which the opinion was published
- was_selected: boolean for whether the opinion belongs to an appeals court case selected for review
- link: riddelname of the appeals court case which was selected for review to spawn some Supreme Court cases. Not all Supreme Court cases have links, as not all arose from the Federal Appeals Courts. Furthermore, some links are not present in our dataset (not within date range, or from the Federal Circuit).
- jurisdiction: the "Manner in which the Court takes Jurisdiction", according to the Supreme Court Database (http://scdb.wustl.edu/documentation.php var=jurisdiction). This is used to determine which cases are in the mandatory jurisdiction of the SCOTUS. Note that some SCOTUS cases are missing from the SCDB (roughly 10%); these cases have "n/a" in the jurisdiction field. The same holds for all non-SCOTUS cases, for whom jurisdiction is inherently inapplicable.
The second file is called harv_final_aut_corr_cits.csv. It contains metadata about the case; in particular, citations and a de-noised author field. The columns are:
- id: same id as in the previous file. (Name, Year, Citation, riddelName, url, Circuit, was_selected, link are duplicated from the previous file)
- cit_count: the number of times each opinion was cited by any other opinion in the Caselaw database
- cit_count_prop: the number of times each opinion was cited by another Appeals Court or SCOTUS case
- Author_orig: the original Author name from Caselaw, which is a noisy signal
- Author_final: the de-noised Author string based on Author_orig, as well as a list of all judges from the Federal
- Judicial Center (FJC); if we were unable to determine the judge, we assign "unk*" (some variation on unk, to denote where in the de-noising process we established that we were unable to find the proper author. For example, unk_init means Caselaw didn't know the author; ambig_multi means there were several judges who could have matched; unk_no_similar means we could find no matching judge; unk_regex means after filtering out noise tokens such as "judge" or "circuit", we were left with an empty judge string; prqrm_plus means that there is a per curiam opinion as well as another judge's opinion).
- Author_final_conf: a score for our de-noising algorithm. Any score >= 0 means we are fairly confident in our prediction. A score of 0 is extremely confident; a score from 1-2 means there were a few tokens we had to swap (scolia beomes scalia and has a score of 1, meaning Levenshtein difference of 1); a score from 10-12 means the author was from a different appeals court than the case was argued in (the ones place again represents the Levenshtein score); a score of 20-22 means the judge comes from a different (not necessarily appeals) court than the case was argued in.
- Judge_id: the unique judge ID taken from the FJC
- Party: the party of the appointing president of each judge, taken from FJC
Model output files
There are once again two types of model output files, located in the harv_processed.zip file (which expands to the harv_processed). The first is in the format harvard_merge_k_{K}.csv for some K in [2,5,10,20,25,30,35,40,45,50]. These files contain some of the same columns as the data files, as well as influence and topic columns:
- topic_{k}_{control/disup}: this contains the proportion of topic {k} for each document, for each of the three runs (some values of k have no disup): control, disup (copying the lower court's opinion into the place of its linked SCOTUS opinion). It holds that the sum of topic_{1..K}{control/disup} is 1 for each document. This value is always greater than 0.
- Influence_{control/disup}: the influence of each document, as estimated by rDIM, for each run type. This value is generally positive, but can be negative too.
The second type of file is the vocab file, called vocab_k_{K}.csv. The vocab files are located in the "control" sub-directory, and the disup vocab files are located in the "disup" sub-directory. For example, the vocab file for the control run with K=2 is located at control/vocab_k_2.csv. It has columns:
- vocab: the name of each vocab word in the vocabulary.
- wordProb_topic{k}Epoch{e}: the probability of that word for topic k in epoch e
Sharing/Access information
Data was derived from the following sources:
- Caselaw's bulk restricted case text file https://case.law/download/bulk_exports/latest/by_jurisdiction/case_text_restricted/us/
- FJC's list of judges https://www.fjc.gov/history/judges
Code/Software
There is a separate README.md provided in the zipped code we have provided. Please refer to this.
Methods
The data was curated from four main sources:
- Harvard Caselaw Access Project case.law
- Federal Judicial Center (FJC) list of judges
- A list of federal appeals court cases selected for review, as well as their corresponding SCOTUS opinions from Livermore et. al. "The Supreme Court and the Judicial Genre"
- The Supreme Court Database (SCDB)
The opinions were cleaned using standard text cleaning techniques. The authors were deduced by performing regular expression matches between the noisy Caselaw author field and a list of judges from the FJC.