Спросить

Войти

Категория: Компьютерные технологии

Автор: Shakhovska N.

-□ □Проаналiзовано проблеми, що виникають пи) час роботи з окремими джерелами, з використанням даних репозитарив та баз даних. Визначено поняття якостi тформацшного продукту. Описуються основш види невизначеностi. Побудовано метод оцтювання якостi тформацшного продукту. Це дозволяв прогнозувати ятсть даних в каталозi Великих даних та отрима-ти певш ефекти вiд впровадження. Зокрема, може бути тдвищена ефективтсть пошуку залежностей у Великих даних

Ключовi слова: велик дат, зменшення невизначено-стi, фактор ризику, Ф-залежтсть, користсть тформацшного продукту

Проанализированы проблемы, возникающие при работе с отдельными источниками, с использованием данных репозитариев и баз данных. Дано определение качествп информационного продукта. Описываются основные виды неопределенности. Построен метод оценки качества информационного продукта. Это позволяет прогнозировать качество данных в каталоге Больших данных и получить определенные эффекты от внедрения. В частности, может быть повышена эффективность поиска зависимостей в Больших данных

|DOI: 10.15587/1729-4061.2018.123064|

UNCERTAINTY REDUCTION IN BIG DATA CATALOGUE FOR INFORMATION PRODUCT QUALITY EVALUATION

N. Shakhovska

Doctor of Technical Sciences, Professor, Head of Department* Е-mail: nataliy.b.shakhovska@gmail.com

O. Vovk PhD, Associate Professor* Е-mail: olenavovk@gmail.com Yu. Kryvenchuk PhD, Assistant* Е-mail: yurkokryvenchuk@gmail.com *Department of Artificial intelligence Lviv Polytechnic National University S. Bandery str., 12, Lviv, Ukraine, 79013

1. Introduction

Big data information technology is the set of methods and means of processing different types of structured and unstructured dynamic large amounts of data for their analysis and use for decision support. There is an alternative to traditional database management systems and solutions class Business Intelligence. This class attribute of parallel data processing (NoSQL, algorithms MapReduce, Ha-doop) [1].

Big Data features are:

- working with unstructured and structured information;

- orientation on the fast data processing;

- leads to the fact that traditional query language is ineffective while working with data.

Information objects describe a certain subject area, consolidated data and relationships between objects constitute the Big data catalogue. One of the problems that arise from the process of consolidation is the indeterminacy of data as the result of doubling, inexactitude, absence, contradictory data. Also, indeterminacy arises from the installation of wrong connections between objects. Therefore, there is a task of reduction of indeterminacy for upgrading of data.

Since the data comes from various sources, some set of data may be missing in the data source, and the other may overlap in various information products. Therefore, there is a problem of doubling, absence, imperfection, and vagueness of data.

Indeterminacy can arise at the level of attribute tuple and relation (indeterminacy in the circuit description).

The appearance of indeterminacy in the attribute and tuple due to multidimensionality display leads to the spread of uncertainty in all copies of a particular concept.

Since the Big data catalogue of millions of data items subject area, the traditional means of handling indeterminacy (interval maths, multivalent logic) become ineffective because of the large number of operands.

Thus, the specificity of Big data catalogue (the presence of a diverse set of sources, data doubling, ambiguity of describing data sources) leads to the fact that the indeterminacy in traditional relational databases is considered within a relationship and could occur at the level of attribute and tuple-level attitude in this case extends through the perception of the user information on the entire Big data catalogue. Therefore, for processing indeterminacy in the Big data catalogue, a different approach must be used, the use of which was unnecessary in relational databases and data warehouses.

The uncertainty reduction is the actual problem nowadays. First of all, we collect information from various sources and this information may be double, contradictory, etc. After that, we try to analyze this information (find dependencies, classification, clustering, etc.) Inexact information allows us to find inexact dependencies. That is why such information can&t be used in decision support systems. As a result, the data availability is reduced.

That is why uncertainty reduction in Big data catalogue is an actual problem.

2. Literature review and problem statement

Classify types of indeterminacy by the nature of their manifestation in the Big data catalogue. One of the first works in this direction is [2].

In the [3], it is emphasized that indeterminacy, as the objective form of life surrounding of the real world, is conditioned, on the one hand, by the objective existence of randomness as forms of need, but on the other hand - the imperfection of each act of reflection real phenomenon in the human consciousness. Imperfection of reflection unstoppable through the universal connection of all objects of the real world and the infinity of their development. Indeterminacy is expressed in a variety of conversion possibilities in reality, the existence of the set (as a rule, endless number) of the states in which an object changes in dynamics, may be in future time.

In [4], such types of indeterminacies are defined, the nature of which is:

- value is unknown (missing);

- incompleteness of the information;

- illegibility (usage of distribution for installation of the variety of knowledge);

- the inaccuracy (concerns numerical data);

- non-determination of conclusion procedures of the solutions;

- unreliability of the data;

- multivalence of interpretations;

- linguistic indefinability.

Let us consider the indicated types of equivocations in more detail and find out places of their occurrence in relation.

Uncertainty of types 3-8 are categorized in [5] as wobble of the data and predominantly occur at a level of a tuple or subset of values of attributes.

The zero information is most often met at a level of attribute value.

The incompleteness is a condition of a tuple, in which there are missing values. It is possible to attribute an illegibility, inaccuracy and contingency to physical uncertainty, one of the sources of which is limitation exactly of numeric data types or loss of accuracy in a run time of mathematical operations (here attribute uncertainty arises owing to activity with intervals).

The unreliability and multivalence of interpretations arise in connection with inexact analysis or ambiguous mapping of objects in relation. In relation this type of uncertainty is modeled using padding attribute. The values of this attribute mean the confidence of a tuple or subset of attributes values in a tuple.

The multivalence of interpretation is one of the sources of inconsistencies.

The linguistic uncertainty is connected with usage of natural language for knowledge submission, which has a qualitative nature, and there can be related to misunderstanding of a word or misunderstanding of the contents of the proposal.

Such type of uncertainty is met in systems of text information processing (machine translation system, self-conditioning system, etc.).

The reviewed types of equivocations can be superimposed against each other or to be a source of one another.

Nowadays, the methods of elimination are missing, inexact and indistinct data [1-3] are designed. Therefore, it is necessary to elaborate methods, which can work with all types of uncertainty [6].

Uncertainty of these types may be in database, data warehouse and Big data catalogue (Fig. 2) [7].

quality exercises

Types of indeterminacy in the consolidated data in Big data

catalogue and levels iof their withdrawal

Incompleteness in the level of the data warehouse arises from attacks - block data source, hiding of information as well. Indeterminacy in the level of dictionary and catalogue of data arises primarily from software failures, and because of attacks at the data sources.

Incompleteness in the data warehouse is a source of several problems: NULL-values in data cube; sparse cube; high complexity of computing. Uncertainty in Big data catalogue causes impossibility of data integration.

Let us consider more detail types of indeterminacies and show their appearance in the data warehouse and simple data. In the [6], it is analysed that indeterminacy results from the consolidation of data into a single source (local or virtual), and, therefore, one will have to deal with structured data. As a single source we will use a relational model.

Missing of data occurs due to the lack of description of the required properties in the catalog of data and dictionary. Absence can occur either because the required characteristics are not found in the Big data catalogue information products, or they not included in the catalog or dictionary through the lack of confidence. For the removal of this type of indeterminacy, the repeated use of the agent, maybe with the diminished level of trust to data is necessary.

The inaccuracy of data occurs in the level of characteristics (attributes in the relational databases) and means that the object has value of characteristic, but this value is unknown:

s = {A, unk},

where s is the object that describes the characteristics of processing of consolidated data, unk is the lack of importance, A is the subset of remaining attributes from the tuple of consolidated data.

unk u A = s, unk n A = 0.

Presenting this type of indeterminacy is identical to the data warehouse. The indeterminacy in the data directory is

a source of noise in all the information obtained from the source data with an unknown attribute.

Imperfection is a condition of the object, which is a subset of missing values characteristics. If this subset is empty and we talked about the relational view of data, we get the traditional tuple. Lack of information is also a partial case of incomplete information when the number of unknown tuple attribute values equal to 1. Imperfection may appear as in the case in which data are integrated and in the data dictionary as a result of failures of intelligent agent determining the structure of the source:

as a characteristic of the inverse value of trust in the data source.

s = {A, {unk}}, | unk |<| A |.

s = {A,{unk}}, {unk}<z. A, Design(A)e{unk}.

Non-determination of conclusion procedures occurs when we should save intermediate or final results of the decision support procedure. Also, non-determination occurs in the facts table and in the aggregated attributes. It is modeled by extending of the data scheme and occured exclusively in the consolidated database:

s = s u {unk}, {unk}<£ A, Design(s)e{unk}.

Unreliability is a type of indeterminacy, which is considered one of the characteristics of the object. Although the nature of this feature is uncertain, we use traditional numerical values as domen of this attribute. Unreliability can be applied to traditional values of mathematical operations. It arises as a result of the trust definition in the data source. Unreliability is modeled by additional attribute to the data directory scheme. The value of this attribute is changed as a result of the Big data catalogue. It appears

s = s u [unkj J, unkj £ A, unkj =

1

This type of uncertainty is modeled as well as in the data warehouse, but, unlike data warehouse arises in the relation (catalogue of data).

Indeterminacies of types 3-8 classified as the ambiguity of data mostly occur at the facility or a subset of the values of the characteristics, which form a procession. They arise as a result of attacks on the data sources (information products).

Lack of precision occurs due to incomplete studying or ambiguous displaying of characteristics values. It can be formed using the additional attribute (attributes) in relation scheme. The values contain the level of confidence in the validity of a subset of the values of non-key attributes.

s = {A, unk1, unk2,.., unkn}, AeK,A&, 1 <n<|A&|,

unkattr = Pattr (i, j), Al — {unk1, unk2,...,unkn}, (4)

where Kis the set of importance keys, A& is the subset of the values of non-key attributes.

The level of confidence can be marked using a numerical scale, linguistic assessments, fuzzy values.

The inexactitude is a result of mathematical operations and interval values processing. This type of uncertaity is modeled by an additional attribute and can occur due to the lack of precision in data dictionary.

Unlike data warehouse, this type of uncertainty occurs in Big data catalogue quite often in connection with the processing of data stored on different platforms used to solve different classes of problems.

The multivalence interpretations are a source of irreconcilability. This type of indeterminacy arises most often in the data directory by obtaining information from various sources and the inability to determine the validity of the data. For displaying this type of indeterminacy, we add additional attribute to relation scheme. It contains a degree of confidence in the validity of the data procession. The mul-tivalence interpretations occur only in relation.

Linguistic indeterminacy is connected with the use of a natural language in information resources (in text files and web resources), which have a qualitative character. It can be owing to misunderstanding (lack of knowledge) of a word meaning or misunderstanding of the sense of the offer. Such type of indeterminacy is met in systems of formulating of textual information (the machine translation system, system for self-training, etc.). In the context of Big data catalogues linguistic indeterminacy arises owing to processing semi-structured information (texts, web pages, etc.).

Types of indeterminacies can be imposed or be considered by a source of appearance of each other. For a task of diminution of indeterminacy, the method which is used for indeterminacy reduction in storages of data of regular type - indeterminacy elimination on the basis of a method of extracting of knowledge is improved.

Unknown value of the attribute is considered as a class mark, and the problem of elimination of indeterminacy is transformed into a problem of reference to a class. Use of this method allows eliminating the indeterminacy like "unknown" and "imperfection" at the level of value of the attribute and a subset of attributes. However, unlike Big data, it is necessary to consider still the trust level to the data source, that is work with indeterminacy at the level of the relation.

One of the methods of modeling of inexact, lack of precision and partial data is the insertion of the additional attribute in the catalog sources which value specifies the trust degree to indeterminate data.

In [8, 9], the method of decision tree was used for uncertainty reduction in Big data. However, this method works well only with structured sources.

In [10], Fuzzy Self-Organizing Map and algorithm using fuzzy c-mean (FCM) were used to model uncertainties based on a centralized-batch processing framework. They integrated a fuzzy self-organizing map algorithm with Ma-pReduce framework in order to execute a parallel computing on Big data. However, we can use this method only in data processing, but not in data preprocessing. Particularly, we can&t find the importance of the data source in case of duplicated data.

In [11], the types of Big data uncertainty are described. However, the author analyzed only Unscalable computation ability, Ubiquitous uncertainty and weak relations. That is why all types of uncertainty should be processed in Big data catalogue.

In [12], one aspect of uncertainty is addressed by developing a new methodology to establish the reliability of user-generated data based upon causal links with recurring patterns. The authors associate a large data set of geo-tagged

Twitter messages in San Francisco with points of interest, such as bars, restaurants, or museums, within the city. This model is validated by causal relationships between a point of interest and the number of messages in its vicinity. But we cant&t use this model for multiple data sources analysis.

3. The aim and objectives of the study

The aim of the study was to create the method for each type of uncertainty reduction for increasing the quality of Big data analysis. Also, the definition of information product (InP) quality was given. The model of consolidated data creation allows us to find the probability of exact data source. This allows evaluating the usage of the information product for the Big data analysis process.

To achieve this aim, the following objectives had to be solved:

1. Development of a new model of consolidated data for Big data catalogue creation.

2. Improvement of the method of reducing the indeterminacy of consolidated data.

3. Development of the method for determining the viability of an InP based on the method of indeterminacy reduction.

4. Development of the model of consolidated data

The model of consolidated data is a final set of attributes {Ai, A2,..., An}, set of attributes {A_unk1, A_unk2, A_unkp} with indistinct or non-determinate definitions and set of attributes {Unk1, Unk2,..., Unkm}, which domains are the numerical data, probabilistic data, value of function of accessory of indistinct sets, degree of the validity of multiple-valued logic, percentage, coefficients, various scales or linguistic estimates. Also, the scheme of consolidated data consists of the scheme of the synonyms dictionary Dic and model of the Big data catalog Cg [13]:

The procession of the consolidated data dc is a set of values object substance:

dc =< C, C _ unk, Unk, {dic}, {cg} >,

where C is the subset of attribute values with distinct values, Cunk is the subset of attribute values with fuzzy and non-deterministic values, Unk is the subset of attribute values with truth degrees of attributes Cunk, {dic} is the set of values of the data dictionary, {cg} is the set of values from the directory data.

Datawarehouse of consolidated data is the set of relationships with the scheme Cg and tuples set of consolidated data dc.

The model of consolidated data contains data from all types of sources of Big data catalogue.

5. Development of operations on the model of consolidated data

Because the data warehouse of the consolidated data is expansion-of the data warehouse constructed on the of relational model, we will improve operations.

For processing and analysis of indeterminacies using in query the relational operators, we should use the selection operator by the values of a set attributes Unk. In the data warehouse, there is a similar cut operation. Let r and s be related to the scheme R, r& and s& be related to the scheme R u Unk u Dic u Cg. Then r n s, r u s and r - s is the relation with scheme R, r& n s&, r& u s& and r& - s& is the relation with scheme R u Unk u Dic u Cg.

Considering the probability of attacks (indeterminacy like "multivalence"), we choose those data sources, the level of faith of which is higher than similar:

r = r uo

x(P(n(Cg )))

(Dic) uCg.

Cg & =< {C1, C 2,...,Cn},{C _ unkl, C _ unk2, C _ unkp}, {Unki, Unkl,..., Unkm}, Dic, Cg >,

i Unh@unk)u(C _ unkQc_ \\UOf (Dic)uOC (Cg )

The tuple of the consolidated data dc is the information description of the object t of the data source S presented in the form of a set (procession), importance of characteristics (attributes). The subset of attributes contains data on the object, data source and synonymic names of the object, and these data can be incomplete, indistinct or non-deterministic. The object, presented in this tuple, exists, but the slice : o" part of the information on it is absent, imperfect, fuzzy, non-deterministic, etc.

The values of the consolidated data attributes are = divided into groups.

1. Exact (known) - the importance of the primary key, external key (may be absent). Mark them through C.

2. Absence - no information physically. We use ± for this group.

3. Indeterminacy - set of attributes Unk used for subsets of attributes; Unk indicates a truth degree of these attributes. The default value of the attribute Unk is assigned the value, which means the highest degree of truth. oc

Let&s notice that, in case of absolute trust to each value of a tuple, we receive a traditional relational tuple and we apply where j = n, traditional operations over it. scheme.

Expansion to the relation works correctly in case of assignment of the Unk attribute of the lowest degree of trust to all values (a priori it is considered that this information which is brought in the relation is truthful and full, and nothing is known about the rest information). Selection of such method of representing the degree of validity is by default carried out, proceeding from the principle of isolation.

The operator of cut involves analysis of illegible value set for attribute values Unk.

, (cg& ) =

t î dc 11(Unk) unk, t(C _ unk)c _unk, metaUm oC(Dic) Is Not NULL,oC(Cg) Is Not NULL,unk = P(cg&)

where © is the set of binary relations symbols (marks) on pairs of values domains. For each attribute C_unk we used comparison operations. As a rule, we use only =, ^, <, <, >, >.

Advanced slice operator is distributive relatively to binary Boolean operations:

a (r ys& ) = oa

r& )Y° ^

u or -, r i

s is the relation over the same

The data warehouse drill-down operation is analogue to projection operation in the relational model. For the projection realization in consolidated data we should find connection between subset of attributes Unk and subset of attributes C_unk and check synonyms in the dictionary Dic for the attribute name Cunk. Therefore, the improved drill-down operation is presented as follows:

drill-down:n Xcons (cg&) =

- ISNULL (cCg=R[jCunk=x (c _ unk));

^XunUnk (oCg= meta(C_unk.,Unk.)=l(c_unk))(^C);

IIF(°CuC_Unk=X(Dic); n„CuC_Unk=X(D,c)(r);nX(dc)

where IIF (condition; operationl, operation2) is the operation introduced in the standard SQL 92. If the condition is performed condition 1, otherwise condition 2; ISNULL(r) - logical operator that results in true if the relation r operand does not contain tuples and defect - in that case. Also, we need the search of synonym attribute in the dictionary of synonyms Dic (oC.C Unk=x(Dic)) and replacement (n

OCuC_Unk=X (Dic)

The connection operator is used to link related facts and relation of measurements in consolidated data, since it is based on the relational model.

Traditional connection operator can not be used for Big data catalogue and data warehouse with consolidated data, because for statistical analysis it is necessary to connect related facts relational dimensions. If subsets of attributes Unk is non-empty for the facts and dimensions, such connection is incorrect. Also, operator connections are affected by the fact that there is a need not only to connect with those attributes specified as input parameters, but also to check for synonyms in a dictionary of synonyms Dic. For improving service connection, one should consider cases where the relationship is completely connecting or not connecting fully. For full connecting relations of input attributes set Unk does not affect the operation of the connection. If the set of attributes Unk contains indeterminacy as a foreign key relationship, which is a connection, then this measure of indeterminacy is transferred to all the rest of the attribute values of this ratio. In the case of incomplete connections of attribute Unk with tuples from subordinate tables that do not occur in the relation, the value will be equal to the highest degree of confidence.

across : r x cg& =

?cuc unk=x(Dic);n

(R,B,NVL(Unk,min))

C&.jC_Unk= X (Dic)

(r xcg &)

(r xcg &);

Otherwise, the operation of the left connection by the common attributes is realized, and then over the relation received from the previous operation of projection. The result of this operation is connection with the empty value of a subset of the Unk attributes and min value is saved in Unk.

It should be noted that when the dictionary of synonyms is empty ( Dic = 0 ) and the probability of appeal to data sources as a whole and their characteristics is equal to 1 (Unk = 1), we will receive a traditional relational connection.

6. Reduction of indeterminacy of consolidated data

The analysis of large amounts of data requires identification of groups of attributes that form the functional dependence. However, in the real world data sets are much more common in which important dependencies are defined only on a subset of the values of key attributes, call the following dependencies partial functional dependencies. That is, a partial functional dependency is an FD defined in some fixed ratio selection.

Fp : K = {at},at e A,D = {a;},a; e A,Rc R: K ^ D|R&. (15)

Many relations are not clearly determined, call them probabilistic dependencies of production.

Probabilistic productive relationship is the production rule in the selection of the basic relation that holds a significant number of objects for this selection. The threshold of significance should be determined by experts, or based on calculations of the probability of false selection of this relationship.

F,: K = {at },at eA,D = {aj },a; eA,:P (k eK ^ d eD) = p, (16)

here k and d are the tuples of values of certain groups of attributes K and D, respectively.

The main indicator of the reliability of such dependence is the ratio of objects number with the probabilistic productive relationship to objects number in the selection:

>(R)|

where r is the traditional relation, cg& is the relation with the consolidated data, R is the set of relation attributes r, S is the set of relation attributes cg&, not including a subset of attributes Unk (Cg& = CguUnk), B is the set of attributes with S, which are not covered in relation r (B c Cg, B ct Cg n R), min is the importance, which means the lowest level of faith, NVL(Unk,min) is the operation that assigns min for all values Unk for connecting related processions cg&, x is the left connection. It is necessary to check connections of synonyms (oCuC Unk=X(Dic)). If not, the operations of the left connection for relations with schemes S &and R and the projection of the attributes-synonymous are processed.

Classification rule is called probabilistic productive relationship between subsets of attributes X and Y in the data warehouse with consolidated data cg , which occurs in the test set cg& with a degree of conformity (faith) s, where (X = x) ^ (Y = y).

The classification rule is constructed based on training data set cg , where the tag class value (value of attributes subset Y) is known. The classification rule generally built for the scheme cg , and therefore will not be affected by the new tuples arriving in the relation of the consolidated data repository (independence of the test set).

Mark of class is linguistic variable or traditional object characteristic that is the value of a subset of attributes Y and marks objects with similar (similar with degree s) values of a subset of attributes X. Domains attributes that belong to a subset of Y, y edom(Y) = nY (Cg&), must contain a finite and pre-known set of values.

Marks of a class are selected from a predefined set of values (they are known in test dataset), and reference to a

class of objects information about which just arrived in the data warehouse with the consolidated data, is carried out on the basis of classification rules. The marks will be added automatically, since the new data flow into the data space is also dynamic.

Calculation of the reliability performance of such a relationship is based on the possibility of such a schedule depending on the components of the probabilistic productive relationship:

P ( s eS ^ t eT ) =

= XP(s eS ^ t = ti) = ^

X s = s A t = t\\

As in the case with F-dependencies (functional dependencies), a set of classification rules, which take place in a given relation can be represented by some subset of them, which by inference rules can get all the classification rules of the relationship. Since the classification rules are an extension with F-dependencies, you should consider transforming of functional dependencies axioms for classification rules.

Reflexive property. P(seS^seS) = 1 for any relation r(R).

P ( s eS ^ s eS ) = ■!

|°sei| PseSl

Replenishment: If

P (s eS ^ t eT ) = p, P (s eS a w eD(Wt eT) = p. Proof:

P ( s eS a w eD (W t eT ) = J

seS AweD(W )AteT

seS AweD(W ) (R)

= |Vx er : q = nW=w ( x )eD (W )^w eD (W )| =

0seSAteT (R) „/ „

- |o (gm = P(s eS ^t eT) = P.

PseS (R)|

Additivity: If P ( s eS ^ t eT ) = p

P ( s eS ^ w eW ) = 1, then

P ( s eS ^ t eT a w eW ) = p. Proof:

P ( s eS ^ t eT a w eW )J°seS AteT AweWl =

L = P(s eS ^ t eT ) = p. (21)

Eliminating the uncertainties that occur among the values of the attribute Y in the relation r, is classification using a modified chase algorithm.

The point of the method:

1) search for tuples with the same values in the set of attributes X;

2) search for tuples with the same values in the set of synonyms attributes X;

3) calculation of the level of confidence in the source of tuple obtained in steps 1) and 2);

4) calculation of confidence to attribute sources of tuple obtained in steps 1) and 2);

5) determining the tuples with the highest level of confidence.

If we are able to classify the objects, it&s necessary to build classification functions. Generally, in the space of data information about several types of classes can be stored, and each class type has its own subset of features. One and the same function can be used to specify multiple types of classes.

Classification functions are called the modified functional relationships that are performed for a specific subset of tuples in consolidated data repository.

The classification algorithm:

1) If o(cg&)={dci (Xi) 4-, .., dci (Xn)4} i {dc2 (Xi) 4 ...,

dc2 (Xn) 4,}

And {dc1 (X1) 4, ..., dc1 (Xn)4=dc2 (X1)4, ..., dc2 (Xn) 4}

And {dc1 (Y) 4} i {dc2 (Y) =1} and If oXi(Dic) = 0

Then replace 1 by dc1 ( Y) i

dq(P) = dq(P)/(X m1/n).

2) If {dc1 (X1) 4 ..., dc1 (XK)4}

And {in dc2 m with n importance of attributes - 4, n - m importance of attributes - ±, m < n}

And {P > 1 - m/ } and {on certain importances dq(Xm)4=

=dc2(Xm)4}

And {dc1 (Y ) 4} and 2 (Y )=!}, Then change ± in r dc1 ( Y ) i

dc2(P) = dc2(P)/(X^) •

3) If {in dc; mi with n importance of attributes - 4, mi < n} And {in dcj mj with n importance of attributes - 4, mj < n} And {on certain importances dc;(Xm) 4= dc2(Xm) 4} And {on certain importances dc;(Xm) 4= dc2(Xm) 4}

Andfm/ < 1 and {p > 1 - m ,

I /n /n I [ /n\\

And {dc; (Y) 4} and {dc, (Y) 4} and {dc2 (Y)=!}, Then change ± in dcj ( Y ) and

dc2(P) = dc2(P)/m2&/n).

The method for determining the viability of an InP

Viability is the measure with which an InP is used in a specific subject area to achieve a specific goal with appropriate efficiency, productivity, and satisfaction of needs at intervals of terminal time.

InP is a function of the time of creation, the metadata (number of InPs, technical solutions, etc.).

The method for determining the viability of the InP consists of the following steps:

1. Calculation of InP characteristics.

2. Expert definition of the weight characteristics.

3. Calculation of viability [14].

Step 1. A set of significant values for the site as an information product (dimensionless), obtained on the basis of theoretical and experimental studies, is given as:

Y = {V, K, A, Km, O, Ac, N,Ms, C,Pr}, Y ^[0...1]. (22)

Importance of information V (y1) is the parameter that has a dynamic character and exists only at the moment of interaction of data and methods in the information process for a particular social group (ç ■ - type i social group):

» N* (t)

tí * * INit&

tl < t, < t2,

K(Ip, ) = (IP(Ip,.IR 1t,)V(D | Ipj.IR),

where P(Ip,.IR) is the probability of receiving information messages from Ipj.IR in the moment of time ti, t1 < ti < t2, Di is the current value decision Ip.IR, Ip.IR is the information resource IP.

Adaptability A (y3) (compliance with user requirements) is formed on the basis of an assessment of the ratio of information and intellectual resources in relation to K which makes it possible to determine the number of components (modules) in the InP, that is:

K(|ot( (IR)\\ + |ot( (HR)) |IR| + \\HR\\ &

where |ot(IR)| is the amount of information resource per time ti, | ot (HR) | is the amount of intellectual resource per time ti, |/Ri, |HR| is the total amount of information and intellectual resources, respectively.

Convenience of communication with users Km (y4) is the parameter describing the appearance or ease of use according to the expert&s assessment, in which the InP is available for the maximum number of users. Their weight is estimated by the hierarchy analysis method (the parameter is determined according to the evaluation of the expert - Q):

Km = I Km, x Q

& I Km, &

Km, = 1 Q = 0...1

InP service O (ys) is the depth of linking (the number of transitions from the main link to the required one), etc. This parameter depends on the degree to which the IP meets modern requirements:

O = —, m is the depth of linking. m

InP availability Ac (y6) determines how freely users can use the InP (which was evaluated by the expert, the values are given in Table 1).

Scale of ranking of "Availability"

Interpretation of influence Value range

Complex access 0.0-0.2

Average access 0.3-0.6

Easy access 0.7-1.0

Prevalence of IP N (y7) is the parameter that determines the number of IPs of this type:

where V is the importance of information for the type of social group çs ; t is the time of its using; N (t) is the number of information messages for a social group over a period of time t; ^ Nç is the total number of information messages in this social group.

Usefulness of information messages K (y2) is the parameter that characterizes compliance with the needs of the user, that is, assessing the relevance of information messages in Ip.IR :

COUNT (o

mimeAÏp _ type

COUNT (o Ip_type (M ))

name Alp _type

(M ) is the operation of sampling from

metadata by the type of information product type and its name name, COUNT is the quantity determination function,

(M ) is the sampling operator of metadata by type.

Ip _type

Attendance IP Ms (y8) is the parameter that determines the number of users:

COUNT (0,^^ (Ip)) COUNT(o(Ip)) &

where COUNT(ouser(Ip)) is the number of information product users Ip.

Social affiliation C (y9) is the parameter that defines the circle of users of the given IP (fuzzy ratio of PI positioning):

COUNT (o^=,..(Ip)) COUNT (o_ (Ip))

i = 1...»,

where qi is the type i social group.

Value (Pr) (y1o) is the cost of operating the InP. The ranking scale of the "Cost" characteristic is shown in Table 2.

Scale of ranking of the "Cost" characteristic

Interpretation of influence Value range

Lack of value/low price 0.7-1.0

Average cost 0.4-0.6

High price 0.0-0.3

Step 2. The weight of InP characteristics is determined on the basis of expert evaluation. To do this, we used X = {x1,x2,...,xn} - a set of experts, Y = {y1,y2,...,yp} - a set of characteristics and Ip = {lpvIp2,.--,Ipm} - a set of information products. The function of fuzzy binary relation is defined FR : X x Y ^[0,1].

Then for all x eX, y eY, the function FR (x, y) is the degree of importance of the characteristic y according to the evaluation of the individual x when determining the advantage of a particular InP. The function of fuzzy binary relation H is defined as d: Y x Ip ^[0,1]. For all y eY, Ip elp, dH (y, Ip) is equal to the degree of influence of the characteristic y on the information product Ip, then we form the matrix of characteristics:

dH ( yi-IPi ) dH ( yi-p dH ( y2- IPi ) dH ( y2- IP:

dH ( yn - IPi ) dH ( yn - IP2

■■ dH (yi-IPm ■■ dH ( y 2- IPm

■■ dH ( yn - IPm

elements of which are determined by the function of belonging to a certain sphere of using:

5 fr ( x> y)& dh ( y-p )

( y- IPi ) =-vTTT-:£ F (x- y)

for all x e X, y eT and Ip e Ip,

Risk factors of InP

Type of risk factor Example of risk factor

Risk factor for creating Incorrect relationships and relations Selection of means of implementation Determination of ownership/tenure

Risk factor of use Unauthorized familiarization and use (in particular, copying) Unauthorized linking Unauthorized modification (modification) Deliberate destruction of information

Risk factor of spreading Correctness of InP application Choice of format

The usage of these risk factors allows us to predict the availability of InP information for the whole Big data catalogue.

that is, the basis for constructing a classification rule.

Then the importance of the characteristics Y is determined by the vector R:

By taking into account the rank of the characteristic, the Kendall concordance coefficient was modified to match the cardinal weights of the experts:

_12 _ ^ m(n +1) J2

w=mv-n>S 5=5 R"—J- (33)

where n is the number of analyzed InPs, m is the number of experts, Rj is the rank of the j-th characteristic of the IP-assigned to it by the i-th expert.

Step 3. The viability of the InP is defined as an integral measure

7. Discussion and future work with uncertainty reduction in Big data

So, the new model of consolidated data developing for Big data catalogue was created. This approach allows us to collect data from duplicated . (32) sources. The method of uncertainty reduction can be used for different issues in decision support systems. For example, it should be the first step in the data cleaning process and classification. The model of consolidated data can be used for system documentation and automative metadata creation. The advantage of the method for determining the viability of the information product is the possibility to find useful information products in Big data catalogue in case of duplicated data and find the value of the risk factor. In contrast to the method of Learning from Uncertainty for Big Data, which allows large-scale missing values of big data only, the proposed method works also with indeterminacy. However, it is very difficult to reduce linguistic uncertainty. The future work is finding of the correlation between the value of the risk factor and the type of uncertainty.

Mi (yi - IPi ) A M2 (yi-IP2 )...Mm-i (yi - IPm-i ) A Mm (yi-IPm ) Mi( yv IPi ) A M ( y2- IP2 X-Mm-i ( y 2- IPm-i ) A Mm ( y2- IPm )

M(yi0- IPi) A yi0- IP2 X-Mm-iCyi0- IPm-i ) A ^Am (yi0- IPm )

G = £ wy - £ W = i- G ^[0...i].

8. Conclusions

As a result of the analysis, we combine the decisions S of experts. Ekj is the set of experts who recognized the information product Ip1 the decision was made on low viability, and Ek2 is the set of experts, who recognized the InP as necessary and decided on high viability.

Designation. If Ek1 n Ek2 = 0, Ek1 u Ek2 = {1, ..., N}, then the decided S defined as:

unviableif n G n (i - G ) > ^(i - W )n W ie7] ieï2 ieI2 ieIx

viableif nGi n(i-G ) <AQ (i -W ^W.

ieIi ieÏ2 ieÏ2 ie/j

Also, in this article, the risk factors of InPs and their effect on the viability of PIs at various stages of the life cycle are determined.

Definitions. The InP risk factor is a situational characteristic of the InP, which leads to an uncertain outcome and the occurrence of adverse consequences due to distortion of information or non-relevant search results (Table 3).

1. The model of consolidated data, which is an extension of the model related to the indeterminacy was given. It allowed us to process data with different types of uncertainty. The operations over the relation with indeterminacy for the purpose of their application in the data warehouse with the consolidated data that allowed realizing unary operations of Big data catalogue are improved. It allows us to preprocess all types of uncertainty in Big data and Big data catalogue.

2. The method for reducing the indeterminacy of data available in the repository of consolidated data as a basis for further evaluation of the quality of consolidated data was created. The considered method is useful also for decision making. It provides a search for hidden relationships between the characteristics of the consolidated data repository. Such dependence should be considered when making decisions based on consolidated data. The result of this work is to reduce the uncertainty for assessing the viability of the information product.

3. The method for determining the viability of the information product was created. It allows us to find useful information products in Big data catalogue in case of duplicated data and find the value of the risk factor.

References

1. Shakhovska N. B., Bolubash Y. J., Veres O. M. Big data federated repository model // The Experience of Designing and Application of CAD Systems in Microelectronics. 2015. doi: 10.1109/cadsm.2015.7230882

2. Zadeh L. The concept of a linguistic variable and its application to the adoption of approximate solutions. New York, 1976. 166 p.

3. Tselmer G. Risk consideration in management decisions // Problems of ICSTI. 1980. Issue 3. P. 94-105

4. Knight F. K. Risk, uncertainty and profit. Moscow: Business, 2003. 358 p.

5. Moiseyev N. N. Elements of the theory of optimum systems. Moscow: Science, 1975. 528 p.

6. Trukhachev R. I. Decision-making models in the conditions of uncertainty. Moscow: Science, 1981. 151 p.

7. Shakhovska N., Medykovsky M., Stakhiv P. Application of algorithms of classification for uncertainty reduction // Przeglad Elektrotechniczny. 2013. Vol. 4, Issue 89. P. 284-286.

8. Learning ELM-Tree from big data based on uncertainty reduction / Wang R., He Y.-L., Chow C.-Y., Ou F.-F., Zhang J. // Fuzzy Sets and Systems. 2015. Vol. 258. P. 79-100. doi: 10.1016/j.fss.2014.04.028

9. MRPR: A MapReduce solution for prototype reduction in big data classification / Triguero I., Peralta D., Bacardit J., Garcia S., Herrera F. // Neurocomputing. 2015. Vol. 150. P. 331-345. doi: 10.1016/j.neucom.2014.04.078

10. Karami A. A Framework for Uncertainty-Aware Visual Analytics in Big Data // In AIC. 2015. P. 146-155.

11. Wang X., He Y. Learning from Uncertainty for Big Data: Future Analytical Challenges and Strategies // IEEE Systems, Man, and Cybernetics Magazine. 2016. Vol. 2, Issue 2. P. 26-31. doi: 10.1109/msmc.2016.2557479

12. Taming Uncertainty in Big Data / Bendler J., Wagner S., Brandt T., Neumann D. // Business Information Systems Engineering. 2014. Vol. 6, Issue 5. P. 279-288. doi: 10.1007/s12599-014-0342-4

13. Veres O., Shakhovska N. Elements of the formal model big date // In Perspective Technologies and Methods in MEMS Design (MEMSTECH). 2015 XI International Conference. 2015. P. 81-83.

14. Vovk O. B., Shakhovska N. B. Formation of the factors influencing the behavior of the information product // Radioelectron-ics, informatic, management. 2015. Issue 2. P. 43-53.

БОЛЬШИЕ ДАННЫЕ big data УМЕНЬШЕНИЕ НЕОПРЕДЕЛЕННОСТИ uncertainty reduction ФАКТОР РИСКА risk factor Ф-ЗАВИСИМОСТЬ f-dependence ПОЛЕЗНОСТЬ ИНФОРМАЦИОННОГО ПРОДУКТА usefulness of information product

Другие работы в данной теме: