Time Travel: 2014

Chapter 348 Huge invisible wealth (continued)

There are two reasons why I say it is not an exaggeration:

1. Even in the previous life, the cost of data annotation dropped significantly after the advent of special learning techniques such as dual learning.

Before this, the word "cheap" had nothing to do with data annotation.

Also take the examples listed by Lin Hui before as a reference:

In the previous life in 2017, the cost of 10 million bilingual translation annotations was approximately US$22 million;

Note that this is only a bilingual translation annotation.

"Bilingual translation" is just a label for translation between two languages.

Just the translation and labeling between the two languages ​​requires more than 20 million US dollars?

How much does it cost to translate between hundreds of languages?

This problem is not complicated, it is a simple permutation and combination problem:

C(100,2)==4950; 4950*022 million US dollars==108.9 billion US dollars;

It is not difficult to see that if it is necessary to support translation between hundreds of languages, the cost of manually annotating training sets will reach hundreds of billions of dollars.

And this is only an estimate under ideal circumstances. If such annotation is to be carried out step by step, the actual cost will be far more than this.

After all, the cost of translation between many minor languages ​​is obviously higher than the price of translation between mainstream languages.

Although in actual operation, there will not really be any big complaints about the step-by-step data annotation of hundreds of languages.

But this estimate also fully demonstrates that data labeling will be expensive for a long time.

By the same token, the cost of spatiotemporal data annotation is still expensive today.

And because of the lag in research progress in spatiotemporal machine learning, the cost of data annotation is now even higher than in the same period in the previous life.

2. The times are developing rapidly. You must know that the actual efficiency, reliability, and ease of use of scientific calculators that can be easily purchased in any sports store today can even beat those spent in the 1950s and 1960s. Tens of millions of dollars were spent to build computers covering hundreds or even thousands of square meters.

In this case, the very cheap calculators of later generations would still have a market even if they cost millions of dollars a few decades ago, and they might still be quite competitive.

This example does not mean that Lin Hui will sell calculators a few decades ago.

Lin Hui just wanted to use this to show that the wheel of the times is moving forward, and technology is also developing rapidly.

Especially in the post-Internet era, it is no exaggeration to say that the development of science and technology is changing with each passing day.

In this case, it is normal for some technologies that are not overly valued in the next few years to be able to exchange for large amounts of wealth a few years ago.

What's more, is it still possible to use data annotation, something that has only been played by wealthy companies for a long period of time, in exchange for wealth?

In short, Lin Hui didn’t see anything wrong with the estimate that “10 million pieces of bilingual annotated data would cost two to three billion US dollars now.”

In fact, even the "price of two to three billion US dollars" gives others a very exaggerated feeling.

But in fact, Lin Hui’s price estimate may be a bit conservative.

In the industrial structure of artificial intelligence, the main body includes the application layer, technology layer and basic layer.

The application layer contains solutions and product services.

The technical layer includes application technology, algorithm theory and platform framework.

The base layer contains infrastructure and data.

From this perspective, data can even be considered the cornerstone of artificial intelligence to some extent.

And that's exactly what happened.

It involves the troika algorithm, computing power, and calculation (data) of artificial intelligence.

Algorithms seem very important, but you must know that many times, without high-quality data, it is difficult to train high-quality algorithms.

Although data is usually invisible and intangible, no one can ignore the importance of data.

Especially labeling data is very important.

At present, supervised machine learning is still the main way of learning and training neural networks.

Supervised machine learning is inseparable from labeled data.

Supervised machine learning requires labeled data as prior experience.

In supervised machine learning, unlabeled data and labeled data are divided into training sets and test sets in proportion.

The machine obtains a model by learning the training set, and then identifies the test set to obtain the accuracy of the model.

The algorithm personnel find the shortcomings of the model based on the test results, and feedback the data problems to the data annotation personnel, and then repeat the process until the obtained model indicators meet the online requirements...

In the current situation where there are almost no applications of unsupervised learning, large-scale, high-quality manually annotated data sets can even be said to be an urgent need for the development of the current machine learning industry.

In this case, the importance of data and annotated data cannot be overemphasized.

Hence Lin Hui’s so-called underestimation.

However, the so-called valuation is not important anymore. If it really involves the sale of annotated data, the specific price can be discussed slowly.

Lin Hui needs a lot of money, but if he negotiates with some super giants in the future, Lin Hui may not be obsessed with money.

It is not impossible to exchange resources that Lin Hui is interested in.

To be honest, some of the resources of these top giants are quite tempting to Lin Hui.

Specifically, the annotation data that Lin Hui currently possesses.

When it came to online text translation, Lin Hui almost immediately thought of the SimpleT software on the mobile phone in his previous life.

SimpleT is a software developed and tested by Lin Hui's company in his previous life.

This software is not well known because it is still in alpha beta.

The purpose of alpha testing is to evaluate the functionality, localization, usability, reliability, performance and support of the software product.

Pay special attention to the interface and features of the product.

The time for alpha testing can begin at the end of coding of the software product.

It can also be started after module (subsystem) testing is completed.

You can also start after confirming that the product has reached a certain level of stability and reliability during the testing process.

The alpha internal test of SimpleT software was started after it was confirmed that SimpleT reached a certain level of stability and reliability.

So although SimpleT is still in internal testing.

However, the technical level of this software is also quite mature, and it is almost only one round of public beta away from its official launch.

Lin Hui originally thought of replicating such a software when the time was right to enter the software translation market.

Paying attention to the special value of annotated data.

Lin Hui also thought of the software SimpleT almost immediately.

After all, as a software that focuses on AI translation, it naturally uses a large amount of bilingual translation annotation data during its training process.

And SimpleT, a software that has not been officially launched, is one of the products that the company has high hopes for in its previous life.

Lin Hui believes that the annotated data used in the actual development of this software can definitely be found in the corporate data of previous generations.

In this case, it seems that now we can directly use the annotated data used by the company to train the SimpleT software in exchange for money. Wouldn’t it be more effective with half the effort?

Tap the screen to use advanced tools Tip: You can use left and right keyboard keys to browse between chapters.

You'll Also Like