Friday, December 04, 2020

Text and Data Mining: copyright issues

As information professionals, we always focus on providing accurate and timely information to the users in the best possible ways and finding the different techniques to analyze and extract information from unstructured and structured data from various sources. Researchers are always in thirst for data and the latest information on which they can build upon their future research and support their findings. Researchers are able to work upon more and more research content through TDM because through this process large amounts of information can be analyzed electronically. Text and Data Mining has now become an important tool in scientific research and many other domains. From Social sciences, arts, and literature, and to the other Scientific fields, the role of TDM has become essential to extract the structured and unstructured data and analyze it to reach a certain knowledge pattern. Knowledge discovery through Text and Data Mining (TDM) can definitely lead to some revolutionary findings in many fields.


Text and Data Mining (TDM) is a computational process of generating information by extracting and analyzing structured and unstructured data.


Article 2(2) of the DSM (Digital Single Market) Directive defines text and data mining (‘TDM’) as:

any automated analytical technique aimed at analyzing text and data in digital form in order to generate information which includes but is not limited to patterns, trends, and correlations”1.

“Text and data mining (TDM) is the process of deriving information from machine-read material. It works by copying large quantities of material, extracting the data, and recombining it to identify patterns.” (UK Government)

Difference between Text Mining and Data Mining:

Text Mining is the computational process of extracting and analyzing unstructured data to reach a certain pattern of information.

Data Mining is the computational process of extracting and analyzing structured data to reach a certain pattern of information.

Four Stages of TDM Process in JISC model

1. Research relevant documents are identified to be processed. 
2. These documents are converted into a machine-readable format to extract structured data. machine-readable format of relevant documents is called Normalized documents.
3. Useful information is extracted from the documents called "derived datasets".
4. The extracted information is mined to discover new knowledge.

Copyrights and legal issues involved in TDM activity?

Text and data mining activities include copying the work, extraction of data, and analysis of data to generate useful information. Copying of any work without the permission of the author or whoever is the owner of the work is a violation of copyrights. In this case, the exception to copyright exists which allows the copying of work for non-commercial research. TDM activity is allowed only for that work that is subscribed by the researchers and they have lawful access to that work. 

According to DSM directives, there are two exceptions to the restrictions on copying for TDM.

1.     According to Article 3 TDM is permitted on copyrighted works where the user has lawful access to the protected work. Lawful access means the rights to read the works which are described as “access to content based on an open access policy or through contractual arrangements between rights holders and research organizations or cultural heritage institutions, such as subscriptions, or through other lawful means.”2 Research organizations and cultural heritage institutions involving universities have the primary goal of conducting scientific research and carrying out educational activities are permitted for copying and extraction of data from the copyrighted works if they have “lawful access” to the protected work.  In addition to permitting mining activities Art. 3(2) allows the secure storage and retention of copies of mined works and other subject-matters “for the purposes of scientific research, including for the verification of research results”.

2.     Article 4 permits reproductions of, and extractions from, “lawfully accessible works” for TDM for any purpose. Art. 4 applies only on the condition that right holders have not expressly reserved their rights “in an appropriate manner, such as machine-readable means in the case of content made publicly available online”

      According to Prof. Matthew Sag in his article that “copying expressive works for non-expressive purposes should not be counted as infringement and  must be recognized as fair use.”3

TDM is a good example of non-expressive use of copyrighted works, as the purpose of TDM is not to read those articles but to reach certain patterns of information, trends, and correlations through the automated analysis of the data in those articles.  Those contracts and terms of authors and publishers which restrict the researchers’ activity of text and data mining on their protected works without any reason are unenforceable.

1 comment:

  1. Web hosting is a service offered by firms that operate and maintain the actual servers on which all websites reside. If that statement doesn't make much sense to you, let's go back and explain. As much as we talk about the internet offshore vps as though it existed in the ether (using terms like "cyberspace" and "the cloud"), it also has a physical reality. Every website is made up of many files and software. As a website visitor, you see a final product and don't necessarily consider all of the small components that make it up. However, if you're the one in charge of the website, it's apparent how many various components go into the whole.